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FIELD OF THE INVENTION 



[0002] The present invention relates to an apparatus and method for quantitative protein design and optimization. 
»° BACKGROUND OF THE INVENTION 

[0003] De novo protein design has received considerable attention recently, and significant advances have been 

Z ELSEES r ? Pr ° l ! UCin9 s l ab,e " weM0,ded Pr0,einS ^ fences "Efforts to deign p^Sns^ 
on knowledge of the physical propert.es that determine protein structure, such as the patterns of hydmptabte and 
hydrophnic residues .n the sequence, salt bridges and hydrogen bonds, and secondary structural jwhSSaf^ 

and B-sheet prote.ns wrih naHveJike sequences was attempted by individually selecting the residuerSred ?«S 
8747-8751 (1994) Alternately, a minimalist approach was used to design helical pr oteins where the simplest d oI 
M98 8 ^ 

(1988) DeGrado.ef a/.. Science 243:622-628 (1989); Handel, ef a/.. 8^ 2M«/W (li^^aySS 
grees of success. An experimental method that relies on the hydrophobiclnd polar (HP) pattern of a sequX ^ 

mutSn -T ? l*™* °! S6qUenC6S 0,6 COrr6ct pattem for a four he "* bu " d 'e was gentated^nd^n 
"S ^■^ a '-^ee2 62:168 0-«85(1993)) -Am^gr^nderxwoapp^ches dom^rfnSlSy 
SSS™ ^dor coupled together to achieve a desired tertiary organization <lC£? 
Nature 362.367-369 (1993); Pomerantz, ef a/.. Science 267-93-96 (1995)) 

S3 th 1^1 , , he ^ COrr6C, S6C ° nda , ry StmC,Ur6 and ° vera " tertia,y ^nization seem to have been attafced by sev- 
clrtll ? techn.ques. many designed proteins appear to lack the structural specificity of native proteins The 

wiiTSeoS r. UPS ^ aPP , l ! 6d an ° ex P erimen,a "y ^ted systematic, quantitative methods to protein design 
T?^ o 9 ?,l P,nS 96n6ral deS ' gn a '9° ri,hms < H e«in9a. ef a/.. J.Mol.Biol. 222: 763-785 (1991) Hurlev efT 
i 224:114 3-1154 (1992); Desjarlaisi. ef a/.. Frotein^den,* "«bi6^18(199^S^# j^S L Jh 

Acad. Sci. USA 92:8408-8412 (1995); Klemba, ef a/., NatSfryc Btal M M^SS^mS^^-^T^ 
34:11645-11651(1995); Betzo. ef a/.. Biochemistry feS^^^ 1 7i'?2£ S ^fig&f 

ages^^ 

foreach of which it establishes agroup of potential rotamers, wherein at least one va^bTres^^nr^^r^ 
from at leas, two different amino acid side chains. The method of D1 then wr^^^S^S^T 

rooon ^dSr ,k rt? a ' S ^ sometimes ^ration solvation potentials scoring functions 
P007J In addition, the qualitative nature of many design approaches has hampered the development of improved 
^ond generation, proteins because there are no objective methods for .earning from pas, d^ZS 

m, * fe T 0 . DieCt °' ,h6 inVenU ° n t0 Pf0Vide ^P^ational protein design and optimization via an objective 
quantitative design techn.que implemented in connection with a general purpose computer. 

SUMMARY OF THE INVENTION 

[0009] The invention is set out in appended method daim 1 and compute medium claim 1 5. 
BRIEF DESCRIPTION OF THE DRAWINGS 
[0010] FHjum.BuslrBtasagane^ 
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[001 1] Figure 2 illustrates processing steps associated with an embodiment of the invention. 
[0012] Figure 3 illustrates processing steps associated with a ranking module used in accordance with an embodi- 
ment of the invention. After any DEE step, any one of the previous DEE steps may be repeated. In addition, any one 
of the DEE steps may be eliminated; for example, original singles DEE (step 74) need not be run. 
[0013] Figure 4 depicts the protein design automation cycle. 

[0014] Figure 5 depicts the helical wheel diagram of a coiled coil. One heptad repeat Is shown viewed down the 
major axes of the helices. The a and d positions define the solvent-inaccessible core of the molecule (Cohen & Parry, 
1990, Proteins, Structure, Function and Genetics 7:1-15). 

[001 5] Figures 6A and 6B depict the comparison of simulation cost functions to experimental TrrTs. Figure 6A depicts 
the initial cost function, which contains only a van der Waals term for the eight PDA peptides. Figure 6B depicts the 
improved cost function containing polar and nonpolar surface area terms weighted by atomic solvation parameters 
derived from QSAR analysis; 1 6 cal/mol/A 2 favors hydrophobic surface burial. 

[001 6] Figure 7 shows the rank correlation of energy predicted by the simulation module versus the combined activity 
score of \ repressor mutants (Urn, et a/., J. Mo). Biol. 219:359-376 (1991); Hellinga. et a/.. Proc. Nat l Acad Sci USA 

91:5803-5807 (1994)). ~ 

[001 7] Figure 8 shows the sequence of pda8d aligned with the second zinc finger of Zif268. The boxed ositions were 
designed using the sequence selection algorithm. The coordinates of PDB record 1zaa (Paveletch, et a/., Science 252: 
809-817 (1991)) from residues 33-60 were used as the structure template. In our numbering, position 1 corresponds 
to 1zaa position 33. 

[001 8] Figures 9A and 9B shows the NMR spectra and solution secondary structure of pda8d from Example 3. Figure 
9A is the TOCSY Ha-HN fingerprint region of pda8d. Figure 9B is the N MR NOE connectivities of pda8d. Bars represent 
unambiguous connectivities and the bar thickness of the sequential connections Is indexed to the intensity of the res- 
onance. 

[001 9] Figures 1 0A and 1 0B depict the secondary structure content and thermal stability of a90, a85. a70 and a 1 07. 
Figure 10A depicts the far UV spectra (circular dichroism). Figure 10B depicts the thermal denaturation monitored by 
CD. 

[0020] Figure 11 epicts the sequence of FSD-1 of Example 5 aligned with the second zinc finger of 23f268. The bar 
at the top of the figure shows the residue position classifications: solid bars indicate core positions, hatched bars 
indicate boundary positions and open bars indicate surface positions. The alignment matches positions of FSD-1 to 
the corresponding backbone template positions of Zif268. Of the six identical positions (21%) between FSD-1 and 
2if268, four are buried (Ile7, Phe12, Leu18 and Ile22). The zinc binding residues of Zif268 are boxed. Representative 
non-optimal sequence solutions determined using a Monte Carlo simulated annealing protocol are shown with their 
rank. Vertical lines indicate identity with FSD-1. The symbols at the bottom the figure show the degee of sequence 
conservation for each residue position computed across the top 1000 sequences: filled circles indicate greater than 
99% conservation, half-filled circles indicate conservation between 90 and 99%, open circles indicate conservation 
between 50 and 90%, and the absence of symbol indicates less than 50% conservation. The consensus sequence 
determined by choosing the amino add with the highest occurrence at each position is identical to the sequence of 
FSD-1. 

[0021] Figure 12 is a schematic representation of the minimum and maximum quantities (defined in Eq. 24 to 27) 
that are used to construct speed enhancements. The minima and maxima are utilized directly to find the lijj mb pair 
and for the comparison of extrema. The differences between the quantities, denoted with arrows, are used to construct 
the and q uv metrics. 

[0022] Figures 13A. 13B, 13C, 13D, 13E and 13F depicts the areas involved in calculating the buried and exposed 
areas of Equations 18 and 19. The dashed box is the protein template, the heavy solid lines correspond to three 
rotamers at three different residue positions, and the lighter solid lines correspond to surface areas, a) A° irl3 for each 
rotamer. b) A l/t for each rotamer. c) (A° iit3 - A ift ) summed over the three residues. The upper residue does not bury any 
area against the template except that buried in the tri-peptide state A 0 i/t3 d) A^ for one pair of rotamers. e) The area 
buried between rotamers, (A; t+A hr A itj j), for the same pair of rotamersTas in (d). f) The area buried between rotamers, 
( A bt * A ur A Ud)' summed over the three pairs of rotamers. The area b intersected by all three rotamers is counted twice 
and is indicated by the double lines. The buried area calculated by Equation 18 is the area buried by the template, 
represented in (c), plus s times the area buried between rotamers, represented in (f). The scaling factor s accounts for 
the over-counting shown by the double lines in (f). The exposed area calculated by Equation 19 is the exposed are in 
the presence of the template, represented in (b), minus s times the area buried between rotamers, represented in (f). 

DETAILED DESCRIPTION OF THE INVENTION 



[0023] The present invention is directed to the quantitative design and optimization of amino acid sequences, using 
an "inverse protein folding- approach, which seeks the optimal sequence for a desired structure. Inverse folding is 
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SaLSt^? n ^ iC L Se ,!!f ^ 3 SeqU6nCe ° f S6t ° f SeqUences thal fold Mo a desI ^ structure. 
Igfcen s^utS ^ 3 " fo,din9 " aPPf0aCh WhiCh attemp,S to P redict a 'aken by 

K E JH!. 9ene A ra J Prefefred aPPr0aCh ° f *■ Presenf inVention is 38 follows - althou S h a »e™t e embodiments are 
trfied, wh.ch may be the entire sequence or subset(s) thereof. The side chains of any positions to be varied are ton 

Each vanable res.due pos.hon ,s then preferably classified as a core residue, a surface residue, or a boundary rX£ 
each class.ficat.on defines a subset of possible amino acid residues for the position (for example core residues o^' 
erally will be setected from the set of hydrophobic residues, surface residue's gener£ J^ talT 
droph,l,c res.dues.and boundary residues may be either). Each amino acW can be 

*wed ^formers o f each side chain, called rotamers. Thus, to arrive a. an optical sequenceVa ball a 
poss.ble sequences of rotamers must be screened, where each backbone position can be occupied either bv each 
am.nc in ad ,,s possible rotameric states, or a subset of amino acids, and thus a subset 0^^ " 
[0025] Two sete of mteractions are then calculated for each rotamer at every position: the interaction of" the rotamer 
ento^Zfh^ 

energy) ^nd the .nterachon of the rotamer side chain with all other possible rotamers at every omerr^sifion ^aVuS 
of the other positions (the "doubles" energy, also called the rotamer/rotamer energy) The eZSTtil^ J^l 
-nteracfions is calculated through the use of a variety of scoring functions, which Ke U >Z%SZte ^ 
forces the energy of hydrogen bonding, the energy of secondary structure propensity, the energy of surfacTa?^ 
solvation and the electrostatics. Thus, the total energy of each rotamer interaction. botLth me bacl^one ar^o^r 
rotamers. is calculated, and stored in a matrix form. uacKoone ana other 

£S a^ 6 d ' screte f na,ur i of rotamer sets allows a simple calculation of the number of rotamer sequences to be 

mfl ° f ,en9,h " *** m POSSlb,e r0,amerS <*' P° si,ion Wi " *™ «f P°**le rotamerseqTencSI 

reaTtim ^ ST! ^"f ""f^ s**"™** length and renders the calculations either unwieldy orlmr^Tsibte J 

or^ed Th^Pp" 9 f L * ? T*™™* **** 3 " Dead &d Bi ™ a «°"" ' DEE > -IcufcEs peT 

Ih^ 'J * . ! *° n ' S baS6d ° n me fact thal * ■» worst total '"taction of a first rotamer is still better ton 

to best total .nteraction of a second rotamer. ton to second rotamer cannot be part of the global <Xum sSutto, 
S,nce the energ.es of all rotamers have already been calculated, to DEE approach only rLuSes s^msTer to ^ 
quence .ength ,0 test and eliminate roomers, which speeds up to calcubtio^s considemb JdeI can be ^cot 

™ Pa,rS ^ r ° tamerS ' ° r combinations <* rota ™* which will eventually result in to determination ofTstote 
sequence which represents to global optimum energy. «erm.nanon ot a single 

[0027] Once the global solution has been found, a Monte Carlo search may be done to generate a rank-ordered list 

SSSZTiS r hb0rt, °° d ° f ° EE SO,lA '° n - S ' artin9 3t *» ° EE -dom p"t a ns are SJS 

other rotamers. and the new sequence energy is calculated. If to new sequence meets to criteria for acceotance it 

Is 8 ^ 3 P red « ad "umberof jumps, a rank JSSi 

[0029] Thus, the present invention provides a computer-assisted method of designing a protein The method com- 

mtamers for each of the res,due portions. As used herein, the backbone, or template, includes to backbone atoms 
and any fixed ade chains. The interactions between the protein backbone and the potential SZrT anTb^Sn 
pa.rs of the potential rotamers. are then processed to generate a set of optimized protein sequenS rtl a 
smgle globa. optimum, which then may be used to generate other related sequences sequences, preferably a 
[0030] Figure 1 illustrates an automated protein design apparatus 20 in accordance with an emh«rfim««. „r .k 

SEST I he t H ppara,u f 20 indudes a central processin9 ^ 22 *** — 23TJ?J 2 

of ^put/output devices (e.g.. keyboard, mouse, monitor, printer, etc.) 26 through a bus 28. TheTenTral intefa^bS 
between a centra processing unit 22. a memory 24. input/output devices 26. and a bus 28 is k^^to aTrSe 
EST Tal'm d f T ed , ta T i *■ aU, ° ma,ed ■»** 30 stored in the memory^ 

EL hSl l , P " ,9n Pr09ram 30 be im P |em ented with a side chain module 32 As discussed 
lure Pie^Droi^'i rf • ain m °du' e establishes a group of potential rotamers for a selected protein backborve ^tjc- 
ure. The prolan design program 30 may also be implemented with a ranking module 34 As discussedTn deTaTbel^ 
to rankmg module 34 ana lyz es to interaction of rotamers with to protein backbone structure^ Z SSJSSS 
prote,n sequences^ The protein design program 30 may also include a search module 36 to exS a Search for 

mZ^J Car,0 , SearCh 35 deSC,ibed belOW ' re,ati0n to "» °P b - ized P"W sequelsTrLlly an asselt 
ment module 38 may also be used to assess physical parameters associated with to derived proteto as d"SsS 



4 



EP 0 974 111 B1 



further below. 

[0032] The memory 24 also stores a protein backbone structure 40, which is downloaded by a user through the input/ 
output devices 26. The memory 24 also stores information on potential rotamers derived by the side chain module 32. 
In addition, the memory 24 stores protein sequences 44 generated by the ranking module 34. The protein sequences 
44 may be passed as output to the input/output devices 26. 

[0033] The operation of the automated protein design apparatus 20 is more fully appreciated with reference to Fig. 
2. Fig. 2 illustrates processing steps executed in accordance with the method of the invention. As described below, 
many of the processing steps are executed by the protein design program 30. The first processing step illustrated in 
Fig. 2 is to provide a protein backbone structure (step 50). As previously indicated, the protein backbone structure is 
downloaded through the input/output devices 26 using standard techniques. 

[0034] The protein backbone structure corresponds to a selected protein. By "protein" herein is meant at least two 
amino acids linked together by a peptide bond. As used herein, protein includes proteins, oligopeptides and peptides. 
The peptidyl group may comprise naturally occurring amino acids and peptide bonds, or synthetic peptidomimetic 
structures, Le. "analogs", such as peptoids (see Simon et a/., PNAS USA 89(20):9367 (1992)).. The amino acids may 
either be naturally occuring or non-naturally occuring; as will be appreciated by those in the art, any structure for which 
a set of rotamers Is known or can be generated can be used as an amino acid. The side chains may be in either the 
(R) or the (S) configuration. In a preferred embodiment, the amino acids are in the (S) or L-configuration. 
[0035] The chosen protein may be any protein for which a three dimensional structure is known or can be generated; 
that is, for which there are three dimensional coordinates for each atom of the protein. Generally this can be determined 
using X-ray crystallographic techniques, NMR techniques, de novo modelling, homology modelling, etc. In general, if 
X-ray structures are used, structures at 2A resolution or better are preferred, but not required. 
[0036] The proteins may be from any organism, including prokaryotes and eukaryotes, with enzymes from bacteria, 
fungi, extremeophiles such as the archebacteria, insects, fish, animals (particularly mammals and particularly human) 
and birds ad possible. 

[0037] Suitable proteins include, but are not limited to, industrial and pharmaceutical proteins, including ligands, cell 
surface receptors, antigens, antibodies, cytokines, hormones, and enzymes. Suitable classes of enzymes include, but 
are not limited to, hydrolases such as proteases, carbohydrases, lipases; isomerases such as racemases, epimerases, 
tautomerases, or mutases; transferases, kinases, oxidoreductases, and phophatases. Suitable enzymes are listed in 
the Swiss-Prot enzyme database. 

[0038] Suitable protein backbones include, but are not limited to, all of those found in the protein data base compiled 
and serviced by the Brookhaven National Lab. 

[0039] Specifically included within "protein" are fragments and domains of known proteins, including functional do- 
mains such as enzymatic domains, binding domains, etc., and smaller fragments, such as turns, loops, etc. That is, 
portions of proteins may be used as well. 

[0040] Once the protein is chosen, the protein backbone structure is input Into the computer. By "protein backbone 
structure" or grammatical equivalents herein is meant the three dimensional coordinates that define the three dimen- 
sional structure of a particular protein. The structures which comprise a protein backbone structure (of a naturally 
occuring protein) are the nitrogen, the carbonyl carbon, the ocarbon, and the carbonyl oxygen, along with the direction 
of the vector from the a-carbon to the 0-carbon. 

[0041] The protein backbone structure which is input into the computer can either include the coordinates for both 
the backbone and the amino acid side chains, or just the backbone, i.e. with the coordinates for the amino acid side 
chains removed. If the former is done, the side chain atoms of each amino acid of the protein structure may be "stripped" 
or removed from the structure of a protein, as is known in the art, leaving only the coordinates for the "backbone" atoms 
(the nitrogen, carbonyl carbon and oxygen, and the a-carbon. and the hydrogens attached to the nitrogen and a- 
carbon). 

[0042] After inputing the protein structure backbone, explicit hydrogens are added if not included within the structure 
(for example, if the structure was generated by X-ray crystallography, hydrogens must be added). After hydrogen 
addition, energy minimization of the structure is run, to relax the hydrogens as well as the other atoms, bond angles 
and bond lengths. In a preferred embodiment, this is done by doing a number of steps of conjugate gradient minimization 
(Mayo ef a/., J. Phys. Chem. 94:8897 (1990)) of atomic coordinate positions to minimize the Dreiding force field with 
no electrostatics. Generally from about 10 to about 250 steps is preferred, with about 50 being most preferred. 
[0043] The protein backbone structure contains at least one variable residue position. As is known in the art, the 
residues, or amino acids, of proteins are generally sequentially numbered starting with the N-terminus of the protein. 
Thus a protein having a methionine at it's N-terminus is said to have a methionine at residue or amino acid position 1 
with the next residues as 2, 3, 4, etc. At each position, the wild type (i.e. naturally occuring) protein may have one of 
at least 20 amino acids, in any number of rotamers. By "variable residue position" herein is meant an amino acid 
position of the protein to be designed that is not fixed in the design method as a specific residue or rotamer, generally 
the wild-type residue or rotamer. 
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[0044} In a preferred embodiment, all of the residue positions of the protein are variable. That is every amino add 
side chain may be altered in the methods of the present invention. This is particularly desirable for smaller proteins 
although the present methods allow the design of larger proteins as well. While there is no theoretical limit to the length 
of the protein which may be designed this way, there is a practical computational limit. 

[0045J In an alternate preferred embodiment, only some of the residue positions of the protein are variable and the 
remainder are "fixed", that is. they are identified in the three dimensional structure as being in a set conformation In 
some embodiments, a fixed position is left in its original conformation (which may or may not correlate to a specific 
rotamer of the rotamer library being used). Alternatively, residues may be fixed as a non-wild type residue- for example 
when known site-directed mutagenesis techniques have shown that a particular residue is desirable (for example to 
eliminate a proteolytic site or alter the substrate specificity of an enzyme), the residue may be fixed as a particular 
ammo acid. Alternatively, the methods of the present invention may be used to evaluate mutations de novo as is 
discussed below. In an alternate preferred embodiment, a fixed position may be "floated"; the amino acid at that position 
is fixed, but different rotamers of that amino acid are tested. In this embodiment, the variable residues may be at least 
one. or anywhere from 0.1 % to 99.9% of the total number of residues. Thus, for example, it may be possible to chanoe 
only a few (or one) residues, or most of the residues, with all possibilities in between. 

[0046) In a preferred embodiment, residues which can be fixed include, but are not limited to. structurally or biolog- 
ically functional residues. For example, residues which are known to be important for biological activity, such as the 
residues which form the active site of an enzyme, the substrate binding site of an enzyme, the binding site for a binding 
partner (l.gand/receptor, antigenlantibody. etc.). phosphorylation or glycosylate sites which are crucial to biological 
function, or structurally important residues, such as disulfide bridges, metal binding sites, critical hydrogen bonding 
residues, residues critical for backbone conformation such as proline or glycine, residues critical for packing interact 
lions, etc. may all be fixed in a conformation or as a single rotamer, or "floated". 

Sirni,arl V- residues which may be chosen as variable residues may be those that confer undesirable biological 
attnbutes. such as susceptibility to proteolytic degradation, dimerization or aggregation sites, glycosylate sites which 
may lead to immune responses, unwanted binding activity, unwanted allostery. undesirable enzyme activity but with a 
preservation of binding, etc. y 

[0048] As will be appreciated by those in the art, the methods of the present invention allow computational testing 
of site-directed mutagenesis" targets without actually making the mutants, or prior to making the mutants That is 
quick analysis of sequences in which a small number of residues are changed can be done to evaluate whether a 
herein b desirab,e - ln addition ' «■ "^V b edone on a known protein, or on an protein optimized as described 

[0049] As will be appreciated by those in the art, a domain of a larger protein may essentially be treated as a small 
independent protein; that is. a structural or functional domain of a large protein may have minimal interactions with the 
remainder of the protein and may essentially be treated as if it were autonomous. In this embodiment, all or part of the 
residues of the domain may be variable. f 
[0050] It should be noted that even if a position is chosen as a variable position, it is possible that the methods of 
the invention will optimize the sequence in such a way as to select the wild type residue at the variable position This 
generally occurs more frequently for core residues, and less regularly for surface residues. In addition, it is possible 
to fix residues as non-wild type amino acids as well. hwwuw 
[0051] Once the protein backbone structure has been selected and input, and the variable residue positions chosen 
a group of potential rotamers for each of the variable residue positions is established. This operation is shown as step 
Z, 1S!II- ^? fZ"^ b f im P' emen,ed »>e side chain module 32. In one embodiment of the invention, 
the side cham module 32 includes at least one rotamer library, as described below, and program code that correlates 
the selected protein backbone structure with corresponding information in the rotamer library. Alternatively the side 
^'","^1 may be omitted and the potential rotamers 42 for the selected protein backbone structure may be 
downloaded through the inputroutput devices 26. y B 

S I s ^°7i n ,he , art - „ each ami "° a « d «de chain has a set of possible conformers. called rotamers. See 
Ponder er a .. Acad. Press Inc. (London) Ltd. pp. 775-791(1987); Dunbrack, et a/., Struc. Biol. 1(5):334-340 (1994)- 
Desmet.era/1^ 

iS^l! . S H!Z2! aC r ro ^ mers f OT u evef y amin ° a °' d side chain is used. There are two general types of rotamer 

h P !^ ent baCkb ° ne independent A backbone roomer library a.lows different 

rotamers depending on the position of the residue in the backbone; thus for example, certain leucine rotamers are 
allowed ,f the position is withm an a hefix. and different leucine rotamers are allowed if the position is not in a a-helix. 
Abackbone independent rotamer library utilizes all rotamers of anaminoaddateverypositionJngenere ate^ 
indepen denU,b ra ry ,s preferred in me consideretion of core residues, since flexiW 

S2Sn! '^ epe " de "! ' ibraries ar f computationally more expensive, and thus for surface and boundary positions, a 
backbone dependent library ,s preferred. However, either type of library can be used at any position 

[0053] 'nadditfon.apreferredembodimentdoesatypeof-finetuning-oftherotamerfibrarybyexpandingthepossible 
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X (chi) angle values of the rolamers by plus and minus one standard deviation (or more) about the mean value, in order 
to minimize possible errors that might arise from the discreteness of the library. This is particularly important for aromatic 
residues, and fairly important for hydrophobic residues, due to the increased requirements for flexibility in the core and 
the rigidity of aromatic rings; it is not as Important for the other residues. Thus a preferred embodiment expands the 

and % 2 angles for all amino acids except Met, Arg and Lys. 
[0054] To roughly illustrate the numbers of rotamers, in one version of the Dunbrack & Karplus backbone-dependent 
rotamer library, alanine has 1 rotamer, glycine has 1 rotamer, arginine has 55 rotamers, threonine has 9 rotamers, 
lysine has 57 rotamers, glutamic acid has 69 rotamers, asparagine has 54 rotamers, aspartic acid has 27 rotamers] 
tryptophan has 54 rotamers, tyrosine has 36 rotamers, cysteine has 9 rotamers, glutamine has 69 rotamers, histidine 
has 54 rotamers, valine has 9 rotamers, isoleucine has 45 rotamers, leucine has 36 rotamers, methionine has 21 
rotamers, serine has 9 rotamers, and phenylalanine has 36 rotamers. 

[0055J In general, proline is not generally used, since it will rarely be chosen for any position, although it can be 
included if desired. Similarly, a preferred embodiment omits cysteine as a consideration, only to avoid potential disulfide 
problems, although it can be included if desired. 

[0056] As will be appreciated by those in the art, other rotamer libraries with all dihedral angles staggered can be 
used or generated. 

[0057] I n a preferred embodiment, at a minimum, at least one variable position has rotamers from at least two different 
amino acid side chains; that is, a sequence is being optimized, rather than a structure. 

[0058] In a preferred embodiment, rotamers from all of the amino acids (or all of them except cysteine, glycine and 
proline) are used for each variable residue position; that is, the group or set of potential rotamers at each variable 
position is every possible rotamer of each amino acid. This is especially preferred when the number of variable positions 
is not high as this type of analysis can be computationally expensive. 

[0059] In a preferred embodiment, each variable position is classified as either a core, surface or boundary residue 
position, although in some cases, as explained below, the variable position may be set to glycine to minimize backbone 
strain. 

[0060] It should be understood that quantitative protein design or optimization studies prior to the present invention 
focused almost exclusively on core residues. The present invention, however, provides methods for designing proteins 
containing core, surface and boundary positions. Alternate embodiments utilize methods for designing proteins con- 
taining core and surface residues, core and boundary residues, and surface and boundary residues, as well as core 
residues alone (using the scoring functions of the present invention), surface residues alone, or boundary residues 
alone. 

[0061] The classification of residue positions as core, surface or boundary may be done in several ways, as will be 
appreciated by those in the art. In a preferred embodiment, the classification is done via a visual scan of the original 
protein backbone structure, including the side chains, and assigning a classification based on a subjective evaluation 
of one skilled in the art of protein modelling. Alternatively, a preferred embodiment utilizes an assessment of the ori- 
entation of the Ca-Cp vectors relative to a solvent accessible surface computed using only the template Ca atoms. In 
a preferred embodiment, the solvent accessible surface for only the Ca atoms of the target fold is generated using the 
Connolly algorithm with a probe radius ranging from about 4 to about 12A, with from about 6 to about 10A being 
preferred, and 8 A being particularly preferred. The Ca radius used ranges from about 1.6A to about 2.3A, with from 
about 1 .8 to about 2. 1 A being preferred, and 1 .95 A being especially preferred. A residue is classified as a core position 
if a) the distance for its Ca, along its Ca-Cp vector, to the solvent accessible surface is greater than about 4-6 A, with 
greater than about 5.0 A being especially preferred, and b) the distance for its cp to the nearest surface point is greater 
than about 1 .5-3 A, with greater than about 2.0 A being especially preferred. The remaining residues are classified as 
surface positions if the sum of the distances from their Ca, along their Ca-Cp vector, to the solvent accessible surface, 
plus the distance from their cp to the closest surface point was less than about 2.5-4 A, with less than about 2.7 A 
being especially preferred. All remaining residues are classified as boundary positions. 

[0062] Once each variable position is classified as either core, surface or boundary, a set of amino acid side chains, 
and thus a set of rotamers, is assigned to each position. That is, the set of possible amino add side chains that the 
program will allow to be considered at any particular position is chosen. Subsequently, once the possible amino acid 
side chains are chosen, the set of rotamers that will be evaluated at a particular position can be determined. Thus, a 
core residue will generally be selected from the group of hydrophobic residues consisting of alanine, valine, isoleucine, 
leucine, phenylalanine, tyrosine, tryptophan, and methionine (in some embodiments, when the a scaling factor of the 
van der Waals scoring function, described below, is low, methionine is removed from the set), and the rotamer set for 
each core position potentially includes rotamers for these eight amino acid side chains (all the rotamers if a backbone 
independent library is used, and subsets if a rotamer dependent backbone is used). Similarly, surface positions are 
generally selected from the group of hydrophilic residues consisting of alanine, serine, threonine, aspartic acid, aspar- 
agine, glutamine, glutamic acid, arginine, lysine and histidine. The rotamer set for each surface position thus includes 
rotamers for these ten residues. Finally, boundary positions are generally chosen from alanine, serine, threonine as- 
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EZZ t ^ ' 5 T*' 9,Utam ' C add - af9inine - ,ysine hisMin *- va " ne - isoleucine. leucine, phenylalanine 

tu SeVen,een res,dues ("»w»>«ng cysteine, glycine and proline are not used, although they can be) 
portions, as rt decreases the number of calculations. It should also be noted that there may be situations where Z 

stncJr:^ 

stances, one or more amino acds is either added or subtracted from the set of aDowed amino adds For examote 
some proteins v*^^^ 

eta In addition, residues that do not allow helix "capping" or the favorable interaction with an *J£5£Z mw2 

ll Li Preferred embodiment, proline, cysteine and glycine are not included in the list of possible amino add 
side chains, and thus the rotamers for these side chains are not used. However, in a preferred embodiment wten me 
variable res,due position has a * angle (that is. the dihedral angle defined by 1) the carbonyl catorteiw 
amine , aa* 2) the nitrogen atom of the current residue; 3) the a<arbon oYthe current residue Z 4) ZZ^Sm 
carbon of the current res.due) greater than 0°, the position is set to glydne to minimize backbone strain 
2SL , 9 x^ ° f P ° ,en . tial r ° tomer * te aSSi9ned ,or eacn variable resi due Position, processing proceeds to 

nS „ ^ ; Th ' S Pr ° CeSS,n9 S,eP en,ai ' S ana,yzing i^^ions of the rotamers with each other and wSe 
protein backbone to generate optimized protein sequences. The ranking module 34 may be used to peXnTLse 

erdeToT^ 

ergies of interactions of the rotamers, either to the backbone itself or other rotamers 

[0066] The scoring functions include a Van der Waals potential scoring function, a hydrogen bond potential scorina 
function an atomfc solvation scoring function, a secondary strocture propensity spring to^tion and an^eSSc 
scoring function. As is further described below, at least one scoring function is used to score Ten posln 7SE£i 

SiT^m IT T dependin9 ° n the pOSiU0n dassifica «°" or other oonsld-to uKSSSS 
action with an a-helix dipole. As outlined below, the total energy which is used in the calculations is S^umSZ, 
energy of each scoring function used at a particular position, as is generally shown in Equation 1 



Equation 1 

E totai = nE^ + nE as + nE^^ + nE ss + nE elec 

[0067J In Equation 1 , the total energy is the sum of the energy of the van der Waals Dotential I E Hh» M „„ n , 
atomic solvation (E as) . the energy of hydrogen bonding <E n JL,). the energy 32S2S5£ E ) an^he 

for the particular residue position, as is more fully outlined below consiaered 
[0068] In a preferred embodiment, a van der Waals' scoring function is used. As is know, in the art van der Waals' 

[0069] The van der Waals scoring function is based on a van der Waals potential energy. There are a number of van 

from the D eiding force field. Mayo ef a/.. J. Prot. Chem. . 1990. expressly incorporated herein by referenl Tfhe 
exponential 6 potential. Equation 2, shown below, is the preferred Lennard-JoneJpotential: 

Equation 2 

[0070] R„ is the geometric mean of the van der Waals radii of the two atoms under consideration and D is th* 
geometric mean of the well depth of the two atoms under consideration E ^ r Zl T s,oeraDon ' a J nd D ° ,s tne 
distance ^ een me ^ atoms under ^-j; ?£ZZSZ2£Z£r ^ — " a " d '*»*'* 
[0071] In a preferred embodiment, the van der Waals forces are scaled using a scaling factor « as is oenerallv 
discussed ,n Example 4. Equation 3 shows tine use of « in the van der Waals Lennard-Jones Stt ^ET 



8 



EP 0 974 111 B1 

Equation 3 




[00721 The role of the a scaling factor is to change the importance of packing effects in the optimization and design 
of any particular protein. As discussed in the Examples, different values for a result in different sequences being gen- 
erated by the present methods. Specifically, a reduced van der Waals steric constraint can compensate for the restrictive 
effect of a fixed backbone and discrete side-chain rotamers in the simulation and can allow a broader sampling of 
sequences compatible with a desired fold. In a preferred embodiment a values ranging from about 0.70 to about 1.10 
can be used, with a values from about 0.8 to about 1.05 being preferred, and from about 0.85 to about 1.0 being 
especially preferred. Specific a values which are preferred are 0.80, 0.85, 0.90, 0.95, 1.00, and 1.05. 
[0073J Generally speaking, variation of the van der Waals scale factor a results in four regimes of packing specificity: 
regime 1 where 0.9 ^ a 5 1.05 and packing constraints dominate the sequence selection; regime 2 where 0.8 <£ a < 
0.9 and the hydrophobic solvation potential begins to compete with packing forces; regime 3 where a < 0.8 and hy- 
drophobic solvation dominates the design; and, regime 4 where a > 1.05 and van der Waals repulsions appear to be 
too severe to allow meaningful sequence selection. In particular, different a values may be used for core, surface and 
boundary positions, with regimes 1 and 2 being preferred for core residues, regime 1 being preferred for surface res- 
idues, and regime 1 and 2 being preferred for boundary residues. 

[0074] In a preferred embodiment, the van der Waals scaling factor is used in the total energy calculations for each 
variable residue position, including core, surface and boundary positions. 

[0075J In a preferred embodiment, an atomic solvation potential scoring function is used. As is appreciated by those 
in the art, solvent interactions of a protein are a significant factor in protein stability, and residue/protein hydrophobic^ 
has been shown to be the major driving force in protein folding. Thus, there is an entropic cost to solvating hydrophobic 
surfaces, in addition to the potential for misfolding or aggregation. Accordingly, the burial of hydrophobic surfaces within 
a protein structure is beneficial to both folding and stability. Similarly, there can be a disadvantage for burying hydrophilic 
residues. The accessible surface area of a protein atom is generally defined as the area of the surface over which a 
water molecule can be placed while making van der Waals contact with this atom and not penetrating any other protein 
atom. Thus, in a preferred embodiment, the solvation potential is generally scored by taking the total possible exposed 
surface area of the moiety or two independent moieties (either a rotamer or the first rotamer and the second rotamer), 
which is the reference, and subtracting out the "buried" area, i.e. the area which is not solvent exposed due to inter- 
actions either with the backbone or with other rotamers. This thus gives the exposed surface area. 
[0076J Alternatively, a preferred embodiment calculates the scoring function on the basis of the "buried" portion; i.e. 
the total possible exposed surface area is calculated, and then the calculated surface area after the interaction of the 
moieties is subtracted, leaving the buried surface area. A particularly preferred method does both of these calculations. 
[0077J As is more fully described below, both of these methods can be done in a variety of ways. See Eisenberg et 
a/., Nature 319: 1 99-203 (1 986); Connolly, Science 221 :709-71 3 (1 983); and Wodak, et a/., Proc. Natl. Acad. Scl.USA 
77(4): 1736-1 740 (1980), all of which are expressly incorporated herein by reference. As will be appreciated by those 
in the art, this solvation potential scoring function is conformation dependent, rather than conformation independent. 
[0078] In a preferred embodiment, the pairwise solvation potential is implemented in two components, "singles" 
(rotamer/template) and "doubles" (rotamer/rotamer), as is more fully described below. For the rotamer/template buried 
area, the reference state is defined as the rotamer in question at residue position i with the backbone atoms only of 
residues i-1 , i and i+ 1 . although in some instances just i may be used. Thus, in a preferred embodiment, the solvation 
potential is not calculated for the interaction of each backbone atom with a particular rotamer, although' more may be 
done as required. The area of the side chain is calculated with the backbone atoms excluding solvent but not counted 
in the area. The folded state is defined as the area of the rotamer in question at residue i, but now in the context of the 
entire template structure including non-optimized side chains, I.e. every other foxed position residue. The rotamer/ 
template buried area is the difference between the reference and the folded states. The rotamer/rotamer reference 
area can be done in two ways; one by using simply the sum of the areas of the isolated rotamers; the second includes 
the full backbone. The folded state is the area of the two rotamers placed in their relative positions on the protein 
scaffold but with no template atoms present. In a preferred embodiment, the Richards definition of solvent accessible 
surface area (Lee and Richards, J. Mol. Biol. 55:379-400, 1971, hereby incorporated by reference) is used with a 
probe radius ranging from 0.8 to 1.6 A, with 1.4 A being preferred, and Drieding van der Waals radii, scaled from 0 8 
to 1.0. Carbon and sulfur, and all attached hydrogens, are considered nonpolar. Nitrogen and oxygen, and all attached 
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KlS^iS^ afeaS are Ca,CU ' ated With me Conno "y a| 9° rith " 1 «*i"9 a dot density of 10 

A-2 (Connolly, (1983) (supra), hereby incorporated by reference). y lu 

[0079] In a preferred embodiment, there is a correction for a possible overestimate of buried surface area whirh 
Ine'bSo^^ 

!ii . ' 35 k i 9enera " y 0UUined below - rotamers are on| y considered in pairs, that is a first rotomeTfe 
^ compared to a second rotamer during the "doubles" calculations, this may overrate Amount Sed 
surface area in locations where more than two rotamers interact, that is where rotamers from th™. „rTl » 
posifions come together. Thus, a correction or scaling factor is used as outed ZZ " 
[0080] The general energy of solvation is shown in Equation 4: 



Equation 4 
Esa=f(SA) 



where >E^ is the energy of solvation, f is a constant used to correlate surface area and enerov and SA is «h* «,nf*~ 



phobic buried surface area is used, Equation 5 Is appropriate 

Equation 5 

E sa = M SA buried hydrophobic) 

where f, is a constant which ranges from about 10 to about 50 cai/mol/A* with 23 or 26 cal/mol/A* beina orefermd 
When a penalty for hydrophilic burial is being considered, the equation is shown in Equation * 9 P 

Equation 6 

E sa 88 f 1 ( S A buried hydro p ho5 i c + f 2( SA buried hydrophilic) 

!^^fo^ 

a P ena| ty for hydrophobic exposure is used, equation 7 or 8 may be used: ««wny. 

Equation 7 

E sa = M SA buried hydrophobic) + f 3( S A €xp osed hydrophobic) 

Equation 8 

E sa = f, <SA buried hydfophobic) ♦ f 2(S A biried hydrophjlfc ) + f 3 (SA exposed hydlophobic , ^ hydrophjljc ) 
[0081] In a preferred embodiment, f 3 = -f, 

[0083] in a preferred embodiment, this overcounting problem is addressed usinn a Qr«N™ *k * 
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[0086] I n a preferred embodiment, the hydrogen bond potential consists of a distance-dependent term and an angle- 
dependent term, as shown in Equation 9: 



Equation 9 









E H -Banding 







where R 0 (2.8 A) and D 0 (8 kcal/mol) are the hydrogen-bond equilibrium distance and well-depth, respectively, and R 
is the donor to acceptor distance. This hydrogen bond potential is based on the potential used in DREIDING with more 
restrictive angle-dependent terms to limit the occurrence of unfavorable hydrogen bond geometries. The angle term 
varies depending on the hybridization state of the donor and acceptor, as shown in Equations 10, 11, 12 and 13. 
Equation 10 is used for sp 3 donor to sp 3 acceptor; Equation 11 is used for sp 3 donor to sp 2 acceptor. Equation 12 is 
used for sp 2 donor to sp 3 acceptor, and Equation 13 is used for sp 2 donor to sp 2 acceptor 

Equation 10 
F = cos 2 6cos 2 (0-1 09.5) 



Equation 1 1 
F = cos 2 9cos 2 0 



Equation 12 
F = cos 4 8 



Equation 13 
F = cos 2 ecos 2 (max[<J> f q>]) 

[0087] In Equations 1 0-1 3, e is the donor-hydrogen-acceptor angle, <J> is the hydrogen-acceptor-base angle (the base 
is the atom attached to the acceptor, for example the carbonyl carbon is the base for a carbonyl oxygen acceptor), and 
cp is the angle between the normals of the planes defined by the six atoms attached to the sp 2 centers (the supplement 
of (p is used when <p is less than 90°). The hydrogen-bond function is only evaluated when 2.6 A <, R < 3.2 A, 6 > 90°, 
<|> -109.5° < 90° for the sp 3 donor - sp 3 acceptor case, and, <|» 90° for the sp 3 donor - sp 2 acceptor case; preferably! 
no switching functions are used. Template donors and acceptors that are involved in template-template hydrogen bonds 
are preferably not included in the donor and acceptor lists. For the purpose of exclusion, a template-template hydrogen 
bond is considered to exist when 2.5 A 5 R 5 3.3 A and 0 £ 135°. 

[0088] The hydrogen-bond potential may also be combined or used with a weak coulombic term that includes a 
distance-dependent dielectric constant of 40R, where R is the interatomic distance. Partial atomic charges are prefer- 
ably only applied to polar functional groups. A net formal charge of +1 is used for Arg and Lys and a net formal charge 
of -1 is used for Asp and Glu; see Gasteiger, ef a/., Tetrahedron 36:3219-3288 (1980); Rappe, et af J. Phys Chem 

95:3358-3363(1991). ' ^ 1 

[0089] In a preferred embodiment, an explicit penalty is given for buried polar hydrogen atoms which are not hydrogen 
bonded to another atom. See Eisenberg, et a/.. (1986) (supra), hereby expressly incorporated by reference. In a pre- 
ferred embodiment, this penalty for polar hydrogen burial, is from about 0 to about 3 kcal/mol, with from about 1 to 
about 3 being preferred and 2 kcal/mol being particularly preferred. This penalty is only applied to buried polar hydro- 
gens not involved in hydrogen bonds. A hydrogen bond is considered to exist when E HB ranges from about 1 to about 
4 kcal/mol, with E HB of less than -2 kcal/mol being prefened. In addition, in a preferred embodiment, the penalty is not 
applied to template hydrogens, i.e. unpaired buried hydrogens of the backbone. 
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[0090] In a preferred embodiment, only hydrogen bonds between a first rotamer and the backbone are scored and 
and the backbone are scored, and rotamer-rotamer hydrogen bonds are scaled by 0 5 

[0091] In a preferred embodiment, the hydrogen bonding scoring function is used'for all positions, including core 

£Z2 P0S " i0nS - a,,em3te *° hydrogen bonding scoring Unction may be ule^on 

oniy one or rwo ot these. 

[0092J In a preferred embodiment, a secondary structure propensity scoring function is used. This is based on the 
specific ammo acid s.de chain, and is conformation independent. That is. each amino acid has a certain propensity to 
take on a secondary structure, either a-helix or p-sheet. based on its + and V angles. See Munoz et al CurrTn Oo 
^otec^ 6 382 (1995); Minor, ef a/.. Nature 367:660-663 (1994); Padmari, 9 et a/.. H^MvLW^. 
Munoz, era. Folding* Design 1(3):167-178 (1996); and Chakrabartty. et a/.. Protein ScT3*43 (19947^0 which 
are expressly.ncorporatedhereinby reference. Thus, for variable residue posifions that are in recognizabteTe^ndS 
structure in he backbone, a secondary structure propensity scoring function Is preferably used. That is when a variable 
^ r^^K a " ttJhelk3i 3rea ° f *" b3Ckbone - the «- helical P^»y ^nng Unction de^d belot s 
T hCr a r^ ,i0n iS m 3 a - he,iCa ' area ° f °» baCkbone is de,e ™ ed as wiil be SpSSSS 

-100 generally descnbe ana-helical area of the backbone. 

[0093] Similarly, when a variable residue position is in a B-sheet backbone conformation, the p-sheet propensity 
sconng funcbon ,s used. B-sheet backbone conformation is generally described by * angles from 3oT-S^nTx 

p^^ 

[0094] In a preferred embodiment, energies associated with secondary propensities are calculated using Equation 

Equation 14 

SSLfi!!^ 0 " 14> \ ( ° r f ] l S 0,6 ° f a " he ' iCal propensit * AG °aa * *he standard free energy of helix 

or standard free energy of B-sheet formation of the amino acid, both of which are available in the literature fsee Chakra 
bartty. ef a^ (1994) (supra), and Munoz. ef a/.. Folding & Design 1(3):167-178 (1996)). both of wh^ Ireln^ 
incorporated herein by reference), and Nss is the propensity scale factor which is sat to range froml to 4 Z^ 0 

rai;:s^^ 

S2?r ln * Pr8f !- fred emb ° dirT,ent ' ,he ^dary structure propensity scoring function is used only in the energy 
calcul^onsfor surface variable residue positions. In altemateembcdiments.^ 

ing function is used in the calculations for core and boundary regions as well Propensity scor- 

[0098] In a preferred embodiment, an electrostatic scoring function is used, as shown below in Equation 15: 

Equation 15 

™™ T ' • 66 ° r foUr SCOrinS fUnCti0ns are 0860 for each variab,e residue position. "Purred 
[0101] Once the scoring functions to be used are identified for each variable position, the preferred first step in the 
computational analysis comprises the determination of the interaction of each possible rotamer JuTa o 2 £ ih! 

each possible rotamer at each variable residue position with either the backbone or other rotamers is caSaS h 
a preferred embodiment, me interaction of each rotamer with the entire remainder of the protein, U b2£2£ 
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template and all other rotamers, is done. However, as outlined above, it is possible to only model a portion of a protein, 
for example a domain of a larger protein, and thus in some cases, not all of the protein need be considered. 
[0102] In a preferred embodiment, the first step of the computational processing is done by calculating two sets of 
interactions for each rotamer at every position (step 70 of figure 3): the interaction of the rotamer side chain with the 

5 template or backbone (the "singles" energy), and the interaction of the rotamer side chain with all other possible n> 
tamers at every other position (the "doubles" energy), whether that position is varied or floated. It should be understood 
that the backbone in this case includes both the atoms of the protein structure backbone, as well as the atoms of any 
fixed residues, wherein the fixed residues are defined as a particular conformation of an amino acid. 
[0103] Thus, "singles" (rotamer/template) energies are calculated for the interaction of every possible rotamer at 

10 every variable residue position with the backbone, using some or ail of the scoring functions. Thus, for the hydrogen 
bonding scoring function, every hydrogen bonding atom of the rotamer and every hydrogen bonding atom of the back- 
bone is evaluated, and the E^ is calculated for each possible rotamer at every variable position. Similarly, for the van 
der Waals scoring function, every atom of the rotamer is compared to every atom of the template (generally excluding 
the backbone atoms of its own residue), and the E vdw Is calculated for each possible rotamer at every variable residue 

15 position. In addition, generally no van der Waals energy is calculated if the atoms are connected by three bonds or 
less. For the atomic solvation scoring function, the surface of the rotamer is measured against the surface of the 
template, and the E as for each possible rotamer at every variable residue position is calculated. The secondary structure 
propensity scoring function is also considered as a singles energy, and thus the total singles energy may contain an 
term - As will be appreciated by those in the art, many of these energy terms will be close to zero, depending on 

20 the physical distance between the rotamer and the template position; that is. the farther apart the two moieties, the 
lower the energy. 

[01 04] Accordingly, as outlined above, the total singles energy is the sum of the energy of each scoring function used 
at a particular position, as shown in Equation 1, wherein n is either 1 or zero, depending on whether that particular 
scoring function was used at the rotamer position: 

25 

Equation 1 

E totai = nE vdw + nE as + nE h ^ ond|ng + nE^ + nE eIec 

30 

[01 05] Once calculated, each singles E total for each possible rotamer is stored in the memory 24 within the computer, 
such that it may be used in subsequent calculations, as outlined below. 

[01 06] For the calculation of "doubles" energy (rotamer/rotamer), the interaction energy of each possible rotamer is 
compared with every possible rotamer at all other variable residue positions. Thus, "doubles" energies are calculated 

35 for the interaction of every possible rotamer at every variable residue position with every possible rotamer at every 
other variable residue position, using some or ail of the scoring functions. Thus, for the hydrogen bonding scoring 
function, every hydrogen bonding atom of the first rotamer and every hydrogen bonding atom of every possible second 
rotamer is evaluated, and the E HB is calculated for each possible rotamer pair for any two variable positions. Similarly, 
for the van der Waals scoring function, every atom of the first rotamer is compared to every atom of every possible 

40 second rotamer, and the E vdw is calculated for each possible rotamer pair at every two variable residue positions. For 
the atomic solvation scoring function, the surface of the first rotamer is measured against the surface of every possible 
second rotamer, and the E as for each possible rotamer pair at every two variable residue positions is calculated. The 
secondary structure propensity scoring function need not be run as a "doubles" energy, as it is considered as a com- 
ponent of the "singles" energy. As will be appreciated by those in the art, many of these double energy terms will be 

45 dose to zero, depending on the physical distance between the first rotamer and the second rotamer; that is, the farther 
apart the two moieties, the lower the energy. 

[0107] Accordingly, as outlined above, the total doubles energy is the sum of the energy of each scoring function 
used to evaluate every possible pair of rotamers, as shown in Equation 16, wherein n is either 1 or zero, depending 
on whether that particular scoring function was used at the rotamer position: 

50 

Equation 16 
E totai = nE^ + "E as + nE^^g + E c|ec 

55 

[0108] An example is illuminating. A first variable position, i, has three (an unrealistically low number) possible ro- 
tamers (which may be either from a single amino acid or different amino acids) which are labelled ia, ib, and ic. A 
second variable position, j, also has three possible rotamers, labelled jd, je, and jf. Thus, nine doubles energies (E total ) 
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such that it may be used in subsequent calculations, as outlined below. ^ ' 

[01 1 0J Once the singles and doubles energies are calculated and stored, the next step of the computational orocess- 

sequences. By opt.rn.zed proton sequence" herein is meant a sequence that best fits the mathematical equations 
here.n As w,U be appreciated by those in the art, a global optimized sequence is the one seque^ThaETts 
Equation 1 . r.e^e sequence that has the lowest energy of any possible sequence. However, there Z any number 
of sequences that are not the global minimum but that have low energies nynum 
[01111 In a preferred embodiment, the set comprises the globally optimal sequence in its optimal conformation i e 
the optimum rotamer at each variable position. That is. computational processing is run until the simuIaZ prSram 
converges on a single sequence which is the global optimum. ""uiauon program 

[0112] In a preferred embodiment, the set comprises at least two optimized protein sequences. Thus for examole 
fte computational processing step may eliminate a number of disfavored combinations but be stoppeo SoTto cci 
vergence.provid.ng a set of sequences of which the global optimum is one. In addition, further compufabonal analSs 
fo example usmg a Afferent method, may be mn on the set, to further eliminate sequences or rank ttZ dmefeSv 
A«erna«.ve.y. as ,s more fully described below, the global optimum may be reached' and then SL ZSSSL 

nuT TZ™ 1 "- additi0nal ° p8,niZed Seqm m ^ neighborhood of the globaSuT 

\™l{ f ! Se * T"" 8 '." 9 m ° re 0n ° ° p,imized pro,ein sec " Jences is senerated. they may be rank ordeTeTin 
terms of theoretical quantitative stability, as is more fully described below. 

[01 1 4] In a preferred embodiment, the computational processing step first comprises an elimination step sometimes 
2° f 3 CUl0fr> 6ither 3 Sin9tes e,imina,ion ° r a doub '^ «n™naBon. Singles eliminaLn 

ou^atn wS " nl T ' nerS , r Plate ' MOn en6rgieS * 9rea,er about 1 0 « Prior toTy'col 
pufaton. with elimination energies of greater than about 15 kcal/mol being preferred and greater than about 25 S 
mol being especially preferred. Similarly, doubles elimination is done when a rotamer has imemctioTenerofes oreS 
han about 1 0 kcal/mo, with all rotamers at a second residue position, with energies g^SZTlTSngte. 
ferred and greater than about 25 kcal/rnol being especially preferred looeingpre- 
LnlmL ^llHT* embodirnen '- « he computational processing comprises direct determination of total sequence 
energ.es. followed by companson of the total sequence enemies to ascertain the global optimum and rank oroer me 
other possible sequences, if desired. The energy of a total sequence is shown below in Equation 17 



Equation 17 



E *>«al protcio ~ E (b-b) + E <i») + ]C ^ 



referred to herein as template-template) energy (E (b . b ) which is constant over all sequences herein since theTckZe 
.steptconstant^esinglesenergy for each -tamer (which has already been ca^ 

ZZShI . ' Pa ' r (WWCh haS a ' ready bee " Ca,cula,ed and s,ored >- Each «°W sequence energy of eaS 

possible rotamer sequence can then be ranked, either from best to worst or worst to best This isTbtioSLmou? 
.onally expens.ve and becomes unwieldy as the length of the protein increases V P 

[0116] In a preferred embodiment, the computational processing includes one or more Dead End Piimi™«™, /nccx 
com^fcnalsteps. The DEE theorem is me basisfo^ 

prcrte.n s.de chams on a fixed backbone with a know* sequence. See Desmet. Z a/., Nature" S££ 

^ 66.1335-1340 1994). all of wh.ch are incorporated herein by reference. DEE is based on the observafionS 
a rotamer cen be eliminated from consideration at a particular position, i.e. make a determined S 5£U 
rotamer ,s definitely not part of the global optimal conformation, me size of the search ZmZZZS TOs fe^SS bv 
= ng ,he worst in.eracfion(i.e. energy ^B^^^^^^^^^l^ 

-Cnce^^^^ 
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Equation 18 

E(ia) + £[min over t{E(ia. jt»l > E(ib) + Zlwax over t{E{ib, jt))) 

[0117J In Equation 1 8, rotamer ia is being compared to rotamer ib. The left side of the inequality is thB best possible 
interaction energy (E total ) of ia with the rest of the protein; that is, "min over t" means find the rotamer t on position j 
that has the best interaction with rotamer ia. Similarly, the right side of the inequality is the worst possible (max) inter- 
action energy of rotamer ib with the rest of the protein. If this inequality is true, then rotamer ia is Dead-Ending and 
can be Eliminated. The speed of DEE comes from the fact that the theorem only requires sums over the sequence 
length to test and eliminate rotamers. 

[01181 In a preferred embodiment, a variation of DEE is performed. Goldstein DEE. based on Goldstein, (1994) 
(supra), hereby expressly incorporated by reference, is a variation of the DEE computation, as shown in Equation 19: 

Equation 19 

E(ia) - E(ib) + £[min over t{E(ia, jt) - E(ib, jt)}] > 0 

[01 1 9] In essence, the Goldstein Equation 1 9 says that a first rotamer a of a particular position i (rotamer ia) will not 
contribute to a local energy minimum if the energy of conformation with ia can always be lowered by just changing the 
rotamer at that position to ib, keeping the other residues equal. If this inequality is true, then rotamer ia is Dead-Ending 
and can be Eliminated. 

[0120J Thus, in a preferred embodiment, a first DEE computation is done where rotamers at a single variable position 
are compared, ("singles" DEE) to eliminate rotamers at a single position. This analysis is repeated for every variable 
position, to eliminate as many single rotamers as possible. In addition, every time a rotamer is eliminated from con- 
sideration through DEE, the minimum and maximum calculations of Equation 18 or 19 change, depending on which 
DEE variation is used, thus conceivably allowing the elimination of further rotamers. Accordingly, the singles DEE 
computation can be repeated until no more rotamers can be eliminated; that is, when the inequality is not longer true 
such that all of them could conceivably be found on the global optimum. 

[0121] In a preferred embodiment, "doubles" DEE is additionally done. In doubles DEE, pairs of rotamers are eval- 
uated; that is, a first rotamer at a first position and a second rotamer at a second position are compared to a third 
rotamer at the first position and a fourth rotamer at the second position, either using original or Goldstein DEE. Pairs 
are then flagged as nonallowable, although single rotamers cannot be eliminated, only the pair. Again, as for singles 
DEE, every time a rotamer pair is flagged as nonallowable, the minimum calculations of Equation 18 or 19 change 
(depending on which DEE variation is used) thus conceivably allowing the flagging of further rotamer pairs. Accordingly, 
the doubles DEE computation can be repeated until no more rotamer pairs can be flagged; that is, where the energy 
of rotamer pairs overlap such that all of them could conceivably be found on the global optimum. 
[0122] In addition, in a preferred embodiment, rotamer pairs are initially prescreened to eliminate rotamer pairs prior ♦ 
to DEE. This is done by doing relatively computationally inexpensive calculations to eliminate certain pairs up front. 
This may be done in several ways, as is outlined below. 

[0123] In a preferred embodiment, the rotamer pair with the lowest interaction energy with the rest of the system is 
found. Inspection of the energy distributions in sample matrices has revealed that an ij v pair that dead-end eliminates 
a particular i f j s pair can also eliminate other ij s pairs. In fact, there are often a few i j v pairs, which we call "magic 
bullets," that eliminate a significant number of ij s pairs. We have found that one of the most potent magic bullets is the 
pair for which maximum interaction energy, t max ([i j v ])k t , is least. This pair is referred to as (i j v ] mb . If this rotamer pair 
is used in the first round of doubles DEE, it" tends to eliminate pairs faster. 

[0124] Our first speed enhancement is to evaluate the first-order doubles calculation for only the matrix elements in 
the row corresponding to the [i j v ] mb pair. The discovery of pjj^ is an n* calculation (n = the number of rotamers per 
position), and the application of Equation 19 to the single row of the matrix corresponding to this rotamer pair is another 
n 2 calculation, so the calculation time is small in comparison to a full first-order doubles calculation. In practice, this 
calculation produces a large number of dead-ending pairs, often enough to proceed to the next iteration of singles 
elimination without any further searching of the doubles matrix. 

[0125] The magic bullet first-order calculation will also discover all dead-ending pairs that would be discovered by 
the Equation 18 or 19, thereby making it unnecessary. This stems from the fact that e max ([i j v ] mb ) must be less than or 
equal to any e^flijj) that would successfully eliminate a pair by he Equation 18 or 19. 

[0126] Since the minima and maxima of any given pair has been precalculated as outlined herein, a second speed- 
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Equation 20 

e minaV s J) < E^ffyj) 

Equation 21: 
e max«Vj)<E max (tyVJ) 



logarithmically. maX,ma Wefe Very lar 9 e ' «» l^nlities were also compared 

[0130J Most of the combinations were able to predict dead-pnriinn m*Mv * * 

metrics wer9 the fractional interval overbp ^ SZJ £ ^ ^ ^ ^ 



Equation 22 

= interval overlap _ e max fl/^p - £min jyj 
ra W^tR/J) " e max - e min 

Equation 23 
_ = interval overlap _ ^.([yVJ - e min ay j) 

[01311 These values are calculated using the minima and maxima equations 24. 25. 26 and 27 (see Figure 14): 

Equation 24 

e -»ni r J,])=e(t/ r j j)+ S maxe<[/ J],k 
Equation 25 

e oin<£'V s J> = e<[/ r j s )) + £ mine ( [/ J ) , * ) 
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Equation 26 



Equation 27 



[0132] These metrics were selected because they yield ratios of the occurrence of dead-ending matrix elements to 
the total occurrence of elements that are higher than any of the other metrics tested. For example, there are very few 
matrix elements ( 2%) for which g re > 0.98, yet these elements produce 30-40% of all of the dead-ending pairs. 
[0133] Accordingly, the first-order doubles criterion is applied only to those doubles for which g„ > 0.98 and q > 
0.99. The sample data analyses predict that by using these two metrics, as many as half of the dead-ending elements 
may be found by evaluating only two to five percent of the reduced matrix. 

[0134] Generally, as is more fully described below, single and double DEE, using either or both of original DEE and 
Goldstein DEE, is run until no further elimination is possible. Usually, convergence is not complete, and further elimi- 
nation must occur to achieve convergence. This is generally done using "super residue" DEE. 
[0135] In a preferred embodiment, additional DEE computation is done by the creation of "super residues* or "uni- 
fication", as is generally described in Desmet, Nature 356:539-542 (1 992); Desmet, et af. t The Protein Folding Problem 
and Tertiary S tructure Prediction, Ch. 10:1-49 (1994); Goldstein, et a/., supra. A super residue is a combination of two 
or more variable residue positions which is then treated as a single residue position. The super residue is then evaluated 
in singles DEE, and doubles DEE, with either other residue positions or super residues. The disadvantage of super 
residues is that there are many more rotameric states which must be evaluated; that is, if a first variable residue position 
has 5 possible rotamers, and a second variable residue position has 4 possible rotarners, there are 20 possible super 
residue rotamers which must be evaluated. However, these super residues may be eliminated similar to singles, rather 
than being flagged like pairs. 

[0136] The selection of which positions to combine into super residues may be done in a variety of ways. In general, 
random selection of positions for super residues results in Inefficient elimination, but it can be done, although this is 
not preferred. In a preferred embodiment, the first evaluation is the selection of positions for a super residue is the 
number of rotamers at the position. If the position has too many rotamers, it is never unified into a super residue, as 
the computation becomes too unwieldy. Thus, only positions with fewer than about 100,000 rotamers are chosen, with 
less than about 50,000 being preferred and less than about 10,000 being especially preferred. 
[0137] In a preferred embodiment, the evaluation of whether to form a super residue is done as follows. All possible 
rotamer pairs are ranked using Equation 28, and the rotamer pair with the highest number is chosen for unification: 

Equation 28 

fraction of flagged pairs 

^(number of super rotamers resulting from the potential unification) 

[0138] Equation 28 is looking for the pair of positions that has the highest fraction or percentage of flagged pairs but 
the fewest number of super rotamers. That is, the pair that gives the highest value for Equation 28 is preferably chosen. 
Thus, if the pair of positions that has the highest number of flagged pairs but also a very large number of super rotamers 
(that is, the number of rotamers at position i times the number of rotamers at position j), this pair may not be chosen 
(although it could) over a lower percentage of flagged pairs but fewer super rotamers. 

[0139] In an alternate preferred embodiment, positions are chosen for super residues that have the highest average 
energy; that is, for positions i and j, the average energy of all rotamers for i and all rotamers for j is calculated, and the 
pair with the highest average energy is chosen as a super residue. 

[0140] Super residues are made one at a time, preferably. After a super residue is chosen, the singles and doubles 
DEE computations are repeated where the super residue is treated as if it were a regular residue. As for singles and 
doubles DEE, the elimination of rotamers in the super residue DEE will alter the minimum energy calculations of DEE. 
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Thus, repeating singles and/or doubles DEE can result in further elimination of rotamers 

CilL'S"* 3 . iS , a r deta1 ^ il,UStration of ^ P-«*ssi"9 operations associated with a ranking module 34 of the 
invent.™ The calculation and storage of the singles and doubles energies 70 is the first step, although these may 
recafcuiated every time. Step 72 is the optona. apptication of a cutoff, where singles or douses e^tasS Soo 
mav m,na .^ P" 01 ".' 0 . r ^ ler P roce ssing. Either or both of original singles DEE 74 or Goldstein^sir^tes DEE 76 
n ?■ ,1 » ^ !,: m " la,,0n ° f 0ri9inal Sin9 ' eS DEE 74 bein 9 9 enera,| y P refe ^- Once the singles DEE is run 

So ^ n ( tT* reS<dUe DEE - TWS Prefefably resUlts in ^rgence at a global optimum sequence As 

^. **** 3> ^ any S,ep any w 311 of 1,16 P fevious steDS be rerun, in any order 
[0142] The addition of super residue DEE to the computational processing, with repetition of the previous DEE steps 

S reSUltS ln r k * ^ 9l0bal 0pflmUm - C °™^"<* to the global optimum is guaranteed no cltoff 

Z P "?n^- ^ ^'f^, 9enera " y 3 9,0ba ' ° pamUm fe 3ChieVed even ^ps. In a preyed 1^ 

ITS. Preferr6d emb0diment> **" Various DEE ste P s are ™ a managable number of sequences is found 
Le. no further processing is required. These sequences represent a set of optimized protein seauences and £™ 
be evaluated as is more fully described below. General.* for computationa? purpose^ ^S^'^ZZ 



[0144] Alternatively .DEE is run to a point, resulting in a set of optimized sequences (in this context a set of remainder 
me^el^ 

SSSi JUT ""ST* ambodimen '- 1,16 «>™P"tation processing need not comprise a DEE computational step In this 
h ^ n,e ° SeafCh iS undertake ". « is known in the art. See Metropolis ef a A. J. Chem pLs 21 

is chosen as a start pomt. In one embodiment, the variable residue positions are dassffied as core boundaro7surface 
residues and mesetofavailable residues at each pos,tion is thus defined.^ 

Mon^ario?^ T ""J" Ch ° Sen " ™" ■™ 38 ta - *L MoTSESST! 

Monte Carlo seardv then makes a random jump at one position, either to a different rotamer of the same aminoadd 

ZllTT, ,^ ^ mam Cn,ena f0r acce P tence . « b "sed ^ the starting point for another jump If the 
Botemann test fa.ls, another random jump is attempted from the previous sequence. In ttfe way sequences with l«Z 
and tower energies are found, to generate a set of low energy sequences sequences w>th lower 

[0146] If computational processing results in a single global optimum sequence, it is frequently preferred to generate 
add,tional sequences in the energy neighborhood of the global solution, which may beLk ed ThSJ ^SiTS 
querK:esarea^optimfeedproteinsequences.Thegeneration of additional optimiz^sequ««slsge^S^ 
ZZ£ 7 Z f , C6S te,Ween ^ the0rettCal 3nd 3CtUal enef 9 ies * 3 se ^ e " ca - Gene^Ty? n a preS 
betgi^atTeas^ 

Derng preferred, at least about 85% homologous being particularly preferred, and at least about 90% beina esoedallv 
pre erred. In some cases, homology as high as 95% to 98% is desirable. Homology in this context means ZSS 

^SET" 8 "f* MnQ C ° mpared - H0m0, ° 9y in this context "«*«»« amino acTvTdTare 
identical and those wh,ch are similar (functionally equivalent). This homology will be determined usinq stanZ 

^^"JS2Sfr ^ IT*™ Pr09ram deSCTibed by D-ereux ef J H ££££ 
387-395 (1984). or the BLASTX program (Altschul. efa/.. J.Mol.Biol. . 215:403-410 (19901) Drefer ablvusinnmaH^ o.H; 
settings for either. The alignment may include the introd uction of gap s in ZVeZenZl to Z <r 

S^^STK? W f d6te 7 in " d ^ed on the number of homologous amino acids in relation toTe Si mXr 

irisa^ shorter - a - — - - d — — - 

[0147] Once optimized protein sequences are identified, the processing of Figure 2 optionally proceeds to st™ <ifi 

search module 36 is a set of computer code that executes a search strategy. For example the search module 36 1 
be wntten to execute a Monte Cario search as described above. Starting^with the flkSSTSSfSZ 

f^mS^ 

meets .e Bo te mann criteria for accX^ ^^^^ 



18 



EP 0 974 111 B1 



1 953, supra, hereby incorporated by reference. If the Boltzmann test fails, another random jump is attempted from the 
previous sequence. A list of the sequences and their energies is maintained during the search. After a predetermined 
number of jumps, the best scoring sequences may be output as a rank-ordered list Preferably, at least about 10 6 jumps 
are made, with at least about 107 jumps being preferred and at least about 10 8 jumps being particularly preferred. 
Preferably, at least about 100 to 1000 sequences are saved, with at least about 10,000 sequences being preferred 
and at least about 100,000 to 1,000,000 sequences being especially preferred. During the search, the temperature is 
preferably set to 1 000 K. 

[01481 Once the Monte Carlo search is over, all of the saved sequences are quenched by changing the temperature 
to 0 K, and fixing the amino acid identity at each position. Preferably, every possible rotamer jump for that particular 
amino acid at every position is then tried. 

[0149] The computational processing results in a set of optimized protein sequences. These optimized protein se- 
quences are generally, but not always, significantly different from the wild-type sequence from which the backbone 
was taken. That is, each optimized protein sequence preferably comprises at least about 5-10% variant amino acids 
from the starting or wild-type sequence, with at least about 15-20% changes being preferred and at least about 30% 
changes being particularly preferred. 

[0150J These sequences can be used in a number of ways. In a preferred embodiment, one, some or all of the 
optimized protein sequences are constructed into designed proteins, as show with step 58 of Figure 2. Thereafter, the 
protein sequences can be tested, as shown with step 60 of the Figure 2. Generally, this can be done in one of two ways. 
[0151] Jn a preferred embodiment, the designed proteins are chemically synthesized as is known in the art. This is 
particularly useful when the designed proteins are short, preferably less than 150 amino acids in length, with less than 
100 amino acids being preferred, and less than 50 amino acids being particularly preferred, although as is known in 
the art, longer proteins can be made chemically or enzymatically. 

[0152] In a preferred embodiment, particularly for longer proteins or proteins for which large samples are desired, 
the optimized sequence is used to create a nucleic acid such as DNA which encodes the optimized sequence and 
which can then be cloned into a host cell and expressed. Thus, nucleic adds, and particularly DNA, can be made which 
encodes each optimized protein sequence. This is done using well known procedures. The choice of codons, suitable 
expression vectors and suitable host cells will vary depending on a number of factors, and can be easily optimized as 
needed. 

[01 53J Once made, the designed proteins are experimentally evaluated and tested for structure, function and stability, 
as required. This will be done as is known in the art, and will depend in part on the original protein from which the 
protein backbone structure was taken. Preferably, the designed proteins are more stable than the known protein that 
was used as the starting point, although in some cases, if some constaints are placed on the methods, the designed 
protein may be less stable. Thus, for example, it is possible to fix certain residues for altered biological activity and 
find the most stable sequence, but it may still be less stable than the wild type protein. Stable in this context means 
that the new protein retains either biological activity or conformation past the point at which the parent molecule did. 
Stability includes, but is not limited to, thermal stability, i.e. an increase in the temperature at which reversible or irre- 
versible denaturing starts to occur; proteolytic stability, i.e. a decrease in the amount of protein which is Irreversibly 
cleaved in the presence of a particular protease (including autolysis) ; stability to alterations in pH or oxidative conditions; 
chelator stability; stability to metal ions; stability to solvents such as organic solvents, surfactants, formulation chemi- 
cals; etc. 

[01 54] In a preferred embodiment, the modelled proteins are at least about 5% more stable than the original protein, 
with at least about 10% being preferred and at least about 20-50% being especially preferred. 
[0155] The results of the testing operations may be computationally assessed, as shown with step 62 of Figure 2. 
An assessment module 38 may be used in this operation. That is, computer code may be prepared to analyze the test 
data with respect to any number of metrices. 

[0156] At this processing juncture, if the protein is selected (the yes branch at block 64) then the protein is utilized 
(step 66), as discussed below. If a protein is not selected, the accumulated information may be used to alter the ranking 
module 34, and/or step 56 is repeated and more sequences are searched. 

[0157] In a preferred embodiment, the experimental results are used for design feedback and design optimization. 
[01 58] Once made, the proteins of the invention find use in a wide variety of applications, as will be appreciated by 
those in the art. ranging from industrial to pharmocological uses, depending on the protein. Thus, for example, proteins 
and enzymes exhibiting increased thermal stability may be used in industrial processes that are frequently run at 
elevated temperatures, for example carbohydrate processing (including saccharification and liqurfaction of starch to 
produce high fructose corn syrup and other sweetners), protein processing (for example the use of proteases in laundry 
detergents, food processing, feed stock processing, baking, etc.), etc. Similarly, the methods of the present invention 
allow the generation of useful pharmaceutical proteins, such as analogs of known proteinaceous drugs which are more 
thermostable, less proteolytically sensitive, or contain other desirable changes. 

[0159] The following examples serve to more fully describe the manner of using the above-described invention, as 
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All references cited herein are explicitly incorporated by reference. 

EXAMPLES 

Example 1 

Protein Design Using van der Waals and Atomic Solvation Scoring Functions with DEE 

A cyclical design strategy was developed that couples theory, computation and experimental testing in order 
of^c^^ 

of four components: a des.gn paradigm, a simulation module, experimental testing and data analysis The T^nn 
paradigm isbasedontheconceptof inverse foldir^ 

(1991)) and consists of the use of a fixed backbone onto which a sequence of side-chain ro^Sln 7e 2£L 
where retainers are the allowed conformations of amino acid side chains (Ponder, et a/.. HSt-ST SpeS 
tertiary interactions based on the three dimensional juxtaposition of atoms are used to dAt«™int 7hl T > 

Sd^t best ad r t ,ar9et fo,d - QVen a ^ ™£££^£szrz 

wh^r; n r as ., ,nput "r simu,a,ion must 9enera,e as a raf * liS t of so^s ™ a <2 

funcbon tha expliatly considers the atom positions in the various rotamers. The principle obstacle is tSffixTd barff 
bone composed of n residues and m possible rotamers per residue (all rotamers of all allowed am ™ adds rSSt 

m"poss,bleanangementsofthesyst e m.animmensenumberforevensmalldesignproblemT F ^ 

50 rotamers at 15 positions results in over 1025 sequences, which at an evaluatL rate of 10< ' sZenS'Sr^old 

Z J*"* cu ™ 1 ««* «-» 10 9 years to exhaustively search for the global miZrn^lnSs 

data or the analysis module. The analysis section discovers correlations between calculable prooertes oftSnl^«H 

s^TonSr^resto^ 

modu^T^! Z T 9U ' d,n9 deS ' gn Paraai9m - ln 0,her words - »» «* function used in the simulation 
d « sc "^theorefcal potential energy surface whose horizontal axis comprises all possible solutfon^ to S 

ta 2 £lt P ° tential ^ SUrfaCS iS 001 9Uaran,eed to mateh actual potential ene^uiE Ito 

cost function in order to create better agreement between the theoretical and actual potential energy surfaces If such 

the « "I 6 " ^ ° f SUbSeqUen ' Simulah0ns be ami "° aSseq^ncSa^ achievt 

the target properties. This design cycle is generally applicable to any protein system and by removing^ suWeS 

IklL P^ 81 ' 16 - 6 " 31 " sel ^«0" a'90rithm requires as input a backbone structure defining the <Sred fold The 
Ihll ! h- ? I*™*™? th . 3t ,3k8S W ' f ° ld 03,1 66 vfawBd as findi "9 a " W arrangemL 0™ add stte 

K^lflSHSS^? Hoolf r 0 " ° f De3d End Biminati0n < DEE ' •"«*»" was d -eloped (Desmet 
torn tS^S^XS h 1994){s 'f a,: GOldS,eia (1994) (supra > ,0 «** »» combinatorialVeach proS 
SiSriT J"? f ° r 3 Wry f3St diSCre,e search a, 9° rithm »ha» vvas designed to pack proteinTde 

constant that a position is limited to the rotamers of a single amino add This extensfon of D*?TJ5» ^ . 
descnbed more fully herein. The guarantee that only the global optimum will be found is still valid, and in our extension' 
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means that the globally optima! sequence is found in its optimal conformation. 

[0164] DEE was Implemented with a novel addition to the improvements suggested by Goldstein (Goldstein, (1994) 
(supra)). As has been noted, exhaustive application of the R=1 rotamer elimination and R=0 rotamer-pair flagging 
equations and limited application of the R=1 rotamer-pair flagging equation routinely fails to find the global solution. 
This problem can be overcome by unifying residues into "super residues" (Desmet, et aL, (1992( (supra); Desmet, et 
aL, (1 994) (supra); Goldstein, (1 994) (supra). However, unification can cause an unmanageable increase in the number 
of super rotamers per super residue position and can lead to intractably slow performance since the computation time 
for applying the R=1 rotamer-pair flagging equation increases as the fourth power of the number of rotamers. These 
problems are of particular importance for protein design applications given the requirement for large numbers of re- 
tamers per residue position, in order to limit memory size and to increase performance, we developed a heuristic that 
governs which residues (or super residues) get unified and the number of rotamer (or super rotamer) pairs that are 
included in the R=1 rotamer-pair flagging equation. A program called PDA^DEE was written that takes a list of rotamer 
energies from PDA_SETUP and outputs the global minimum sequence in its optimal conformation with its energy. 
[0165] Scoring functions: The rotamer library used was similar to that used by Desmet and coworkers (Desmet, 
et a/., (1992) (supra)). X\ and % 2 angle values of rotamers for all amino acids except Met, Arg and Lys were expanded 
plus and minus one standard deviation about the mean value from the Ponder and Richards library (supra) in order to 
minimize possible errors that might arise from the discreteness of the library. 03 and c 4 angles that were undetermined 
from the database statistics were assigned values of 0° and 180° for Gin and 60°, -60° and 180° for Met, Lys and Arg. 
The number of rotamers per amino acid is: Gly, 1; Ala, 1; Val, 9; Ser, 9; Cys, 9; Thr, 9; Leu, 36; lie, 45; Phe, 36; Tyr, 
36; Trp, 54; His, 54; Asp, 27; Asn, 54; Glu, 69; Gin. 90; Met, 21; Lys, 57; Arg, 55. The cyclic amino acid Pro was not 
included in the library. Further, all rotamers in the library contained explicit hydrogen atoms. Rotamers were built with 
bond lengths and angles from the Dreiding forcefield (Mayo, et a/., J. Phys. Chem. 94:8897 (1990)). 
[0166] The initial scoring function for sequence arrangements used in the search was an atomic van der Waals 
potential. The van der Waals potential reflects excluded volume and steric packing interactions which are important 
determinants of the specific three dimensional arrangement of protein side chains. A Lennard-Jones 12-6 potential 
with radii and well depth parameters from the Dreiding forcefield was used for van der Waals interactions. Non-bonded 
interactions for atoms connected by one or two bonds were not considered, van der Waals radii for atoms connected 
by three bonds were scaled by 0.5. Rotamer/rotamer pair energies and rotamer/template energies were calculated in 
a manner consistent with the published DEE algorithm (Desmet, et a/., (1992) (supra)). The template consisted of the 
protein backbone and the side chains of residue positions not to be optimized. No intra-side-chain potentials were 
calculated. This scheme scored the packing geometry and eliminated bias from rotamer internal energies. Prior to 
DEE, all rotamers with template interaction energies greater than 25 kcal/mol were eliminated. Also, any rotamer whose 
interaction was greater than 25 kcal/mol with all other rotamers at another residue position was eliminated. A program 
called PDA.SETUP was written that takes as input backbone coordinates, including side chains for positions not op- 
timized, a rotamer library, a list of positions to be optimized and a list of the amino acids to be considered at each 
position. PDA_SETUP outputs a list of rotamer/template and rotamer/rotamer energies. 

[0167] The pairwise solvation potential was implemented in two components to remain consistent with the DEE 
methodology: rotamer/template and rotamer/rotamer burial. For the rotamer/template buried area, the reference state 
was defined as the rotamer in question at residue / with the backbone atoms only of residues M, / and /+1. The area 
of the side chain was calculated with the backbone atoms excluding solvent but not counted in the area. The folded 
state was defined as the area of the rotamer in question at residue /, but now in the context of the entire template 
structure including non-optimized side chains. The rotamer/template buried area is the difference between the refer- 
ence and the folded states. The rotamer/rotamer reference area is simply the sum of the areas of the isolated rotamers. 
The folded state is the area of the two rotamers placed in their relative positions on the protein scaffold but with no 
template atoms present. The Richards definition of solvent accessible surface area (Lee & Richards, 1971 , supra) was 
used, with a probe radius of 1.4 A and Drieding van der Waals radii. Carbon and sulfur, and all attached hydrogens, 
were considered nonpolar. Nitrogen and oxygen, and all attached hydrogens, were considered polar. Surface areas 
were calculated with the Connolly algorithm using a dot density of 10 A* (Connolly, (1983) (supra)). In more recent 
implementations of PDA_SETUP, the MSEED algorithm of Scheraga has been used in conjunction with the Connolly 
aJgorithm to speed up the calculation (Perrot, et aL, J. Comput. Chem. 13:1-11 (1992)(. 

[01 68] Monte Carlo search : Following DEE optimization, a rank ordered list of sequences was generated by a Monte 
Carlo search in the neighborhood of the DEE solution. This list of sequences was necessary because of possible 
differences between the theoretical and actual potential surfaces. The Monte Carlo search starts at the global minimum 
sequence found by DEE. A residue was picked randomly and changed to a random rotamer selected from those allowed 
at that site. A new sequence energy was calculated and, if it met the Boltzman criteria for acceptance, the new sequence 
was used as the starting point for another jump. If the Boltzman test failed, then another random jump was attempted 
from the previous sequence. A list of the best sequences found and their energies was maintained throughout the 
search. Typically 1 0 6 jumps were made, 1 00 sequences saved and the temperature was set to 1 000 K. After the search 
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10 



15 



25 



30 



35 



was over, all of .he saved sequences were quenched by changing the temperature to 0 K, fixing the ammo acid identity 
PDA M^KfTpZ" 058 lble , r0tamer r P 31 SVe,y posifon - The »~* — implemented in a program «2 

PDA-SEW Wa Y,^l ° PamUm ^ PDA - DEE and a fist ° f -lamer Orgies from 

™-f™ ™» 0utput was a llst °f ^e best sequences rank ordered by their score. PDA SETUP PDA DEE and 

[0169] PDA_SETUP. PDA.DEE. and PDA.MONTE were implemented in the CERIUS2 software development en- 
vironment (Biosym/Molecular Simulations, San Diego, CA). sonware development en- 

10170] Model system and experimental testing: The homodimeric coiled coil of a helices was selected as the 
^taaldes.gnter^ 

and ^^nc tert.aryorgan.zation ease characterization. Their sequences display a seven residue pt^S 
called a heptad repeat, (a-b-c d-e f-g) (Cohen & Pany. Proteins Struc. Func. Genet. 7:1-15 (1990^ a andToT 

exposed (figure 5). The backbone needed for input to the simulation module was taken from the IZZJ 

ta ograph.cally determ.ned fixed field of the rest of the protein. Homodimer sequence symmetry was enforced 3, 

£?aL Hom ? dim f ri f " " ed coils were "wl-tad on the backbone coordinates of GCN4-p1 PDB ascension code 

nt™ rZ crysta,,0 ?: a P h,cal| y determined positions. The program BIOGRAF (Biosym/Molecular Simulations Tan 
?nTn tL^rt l ° r 56 / 318 Wo**** «• the structure which was then conjugate graotenTZ^d fc? 

fh, i 9 ? re '.t n9 fWCefie ' d - ThB HP Pattem Was enforced b V «* a,l °*i"9 hydrophobic amino Snto 
the rotamer groups for the optimized a and d positions. The hydrophobic group consisted ofAla Va™ u tie MeT 

roS * 3 If °L 238 r0famerS PCT P ° Sifi0n - H ° m0dimer *as enforced by pena^ b^Ou 

l^'TZ P3, v that ^ SeqUenCe Symmet,y - Different rofamefs °< amino acTwer atoSed at 

symmetry related positions. The asparagine that occupies the a position at residue 16 was left in th P te^T^JZ 

££££ bl SleP M ° n,e r Car, ° S6arCh ™ 31 3 - 1000 K £^X££££S%22 

rank ordered by their score. To test reproducibility, the search was repeated three times with different randor^ Lumh~ 

^'1* W t Pr ° Wde t eSSentia " y ■*■*-• resu,te - Th8 Mon,e Carlo seaSeTS So^mSS^SS 
abons .n th.s work were performed on a Silicon Graphics 200 MHz R4400 processor 

E,^'" 9 ! 6 n^ 3 ^ d P ° SifonS e3Ch wHh 238 P 088 * 5 ' 8 hydrophobic rotamers results in 23*6 or 10 38 

tint ^fn^T; f, a,9 l ritafi ^ sthe ^ 

time. The DEE solution matches the naturally occurring GCN4- P 1 sequence of a and d residues for all of mTlToosT 
o^eo ^ Cari ° S6arch ^ * 3 ,emPera,Ure ° f 1000 K « d ■» lis « ^aquen-rnk 

I0173] To test reproducibility, the search was repeated three times with different random number seeds and all trials 
provided essentially .dentica. results. The second best sequence is a Va. 30 to Ala mJSTaSEftS 

ara olerateT i "d T S ^T™ WM » ** to P 1 * -quences up to six mutations from the 9n^lT.£2£ 
are tolerated, indicating that a vanety of packing arrangements are available even for a small coiled coti Bqnt set 

atoutTsS/mThfh ^ 

about 15 kcal/mol higher in energy, the 56th and 70th in the list (Table 1). 
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TABLE 1 



Name 


Sequence 


Rank 


Energy 


POA-3H 5 


RMKQLEDKVEELLSKNYHLENEVARLKKLVGER 


1 


-118.1 


PDA-3A 


RMKQLEDKVEELLSKNYHLENEVARLKKLAGER 


2 


-115.3 


PDA-3G 


RMKQLEDKVEELLSKNYHLENEMARLKKLVGER 


5 


-112.8 


PDA-3B 


RLKQMEDKVEELLSKNYHLENEVARLKKLVGER 


6 


-112.6 


PDA-3D 


RLKQMEDKVEELLSKNYHLENEVARLKKLAGER 


13 


-109.7 


PDA-3C 


RMKQWEDKAEELLSKNYHLENEVARLKKLVGER 


14 


-109.6 


PDA-3F 


RMKQFEDKVEELLSKNYHLENEVARLKKLVGER 


55 


-103.9 


PDA-3E 


RMKQLEDKVEELLSKNYHAENEVARLKKLVGER 


70 


-103.1 



[0174] Thirty-three residue peptides were synthesized on an Applied Biosystems Model 433A peptide synthesizer 
using Fmoc chemistry, HBTU activation and a modified Rink amide resin from Novabiochem. Standard 0.1 mmol cou- 
pling cycles were used and amino termini were acetylated. Peptides were cleaved from the resin by treating approxi- 
mately 200 mg of resin with 2 mL trifluoroacetic acid (TFA) and 100 u.L water, 100 jiL thioanisole, 50 \il ethanedithiol 
and 150 mg phenol as scavengers. The peptides were isolated and purified by precipitation and repeated washing 
with cold methyl tert-butyl ether followed by reverse phase HPLC on a Vydac C8 column (25 cm by 22 mm) with a 
linear acetonitrile-water gradient containing 0.1 % TFA. Peptides were then lyophilized and stored at -20 °C until use. 
Plasma desorption mass spectrometry found all molecular weights to be within one unit of the expected masses. 
[01 75J Circular dichroism CD spectra were measured on an Aviv 62DS spectrometer at pH 7.0 in 50 mM phosphate, 
150 mM NaCI and 40 uM peptide. A 1 mm pathlength cell was used and the temperature was controlled by a thermo^ 
electric unit. Thermal melts were performed in the same buffer using two degree temperature increments with an 
averaging time of 10 s and an equilibration time of 90 s. T m values were derived from the ellipticity at 222 nm ([6J 222 ) 
by evaluating the minimum of the d[ei 222 /dT" 1 versus T plot (Cantor & Schimmel, Biophysical Chemistry. New York: W. 
H. Freemant and Company, 1980). The T m *s were reproducible to within one degree. Peptide concentrations were 
determined from the tyrosine absorbance at 275 nm (Huyghues-Despolntes, et a/., supra). 

[0176J Size exclusion chromatography: Size exclusion chromatography was performed with a Synchropak GPC 
100 column (25 cm by 4.6 mm) at pH 7.0 in 50 mM phosphate and 150 mM NaCI at 0 °C. GCN4-p1 and p-LI (Harbury, 
ef a/., Science 262:1401(1993)) were used as size standards. 1 0 ul injections of 1 mM peptide solution were chroma- 
tographed at 0.20 ml/min and monitored at 275 nm. Peptide concentrations were approximately 60 \iM as estimated 
from peak heights. Samples were run in triplicate. 

[0177J The designed a and d sequences were synthesized as above using the GCN4-p1 sequence for the b-c and 
e-f-g positions. Standard solid phase techniques were used and following HPLC purification, the Identities of the pep- 
tides were confirmed by mass spectrometry. Circular dichroism spectroscopy (CD) was used to assay the secondary 
structure and thermal stability of the designed peptides. The CD spectra of all the peptides at 1 °C and a concentration 
of 40 mM exhibit minima at 208 and 222 nm and a maximum at 195 nm, which are diagnostic for a helices (data not 
shown). The ellipticity values at 222 nm indicate that all of the peptides are >85% helical (approximately -28000 deg 
cm 2 /dmol). with the exception of PDA-3C which is 75% helical at 40 mM but increases to 90% helical at 1 70 mM (Table 
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Table 2. 



CD data and calculated structural properties of the PDA peptides. 



Name 


(deg 
cm 2 / 
dmol) 


(°C) 


(kcal/ 
mol) 


(A) 


AA p 

(A) 


Vol 
(A3) 


Rot 
bond 


PDA- 
3H 


33000 


57 


_ 

118.1 


2967 


2341 


1830 


28 


PDA- 
3A 


30300 


48 


115.3 


2910 


2361 


1725 


26 


PDA- 
3B 


28200 


47 


112.6 


2977 


2372 


1830 


28 


PDA- 
3G 


30700 


47 


112.8 


3003 


2383 


1878 


32 


PDA- 
3F 


28800 


39 


103.9 


3000 


2336 


1872 


28 


PDA- 
3D 


27800 


39 


109.7 


2920 


2392 


1725 


26 


PDA- 

3C 


24100 


26 


109.6 


2878 


2400 


1843 


26 


PDA- 

3E 


27500 


24 


103.1 


2882 


2361 


1674 


24 



Eco 
(kcal/ 
mol) 



-234 
-232 
-242 
-240 
-188 
-240 
-149 
-179 



^CG 

(kcal 
/mol) 



-308 
-312 
-306 
-309 
-302 
-310 
-304 
-309 



^vdW 

(kcal/ 
mol) 



Npb 



409 
400 
379 
439 
420 
370 
398 
411 



207 
203 
210 
212 
212 
206 
215 
203 



Pb 



128 
128 
127 
128 
128 
127 
129 
127 



=mc ' s th e Monte 



l^tn JT P °" ^ res P ec,ivel * is the electrostatic energy using equilibrated charges; is the 

volume Rot bondsisthenumberofs.dechainrotatable bonds (excluding methyl rotors); Npb and Pb are the number 
of buried non-polar and polar atoms, respectively. aromenumoer 



SIlHnn ,h "T^ 68 T - s > show a ™& °' values (data not shown), with 6 of the 8 peptides 

ZSL I 9 Te f^^^ temperate. Also, the T m 's were not correlated to the number of sequence at 

t h . P •» S,n9 ' e ami "° add Chan9eS reSU,ted in m * ™ s « a "° ,eas » *bte peptides, demon- 
strabng the importance of specificity in sequence selection. ' 

S ^ eXC,U 1 i0 " chroma,oara P h y «»*ni8d the dimeric nature of these designed peptides. Using coiled coil 
°"g^nzat,on state as standards, the PDA peptides migrated as dimers This result is consist™ 
SJXJErT* P: branched : esidues at a P° s i"°"s ^ leucines at d posifions. which have been shown pre- 
viously to favor dimenzation over other possible oligomerization states (Harbury, et a) supra) 

^? m ^ e -^ haraCten2ati0n ° f PDA demonstrates the successfui design ^several stable dimeric helica. 

coded cod*. The sequences were automatically generated in the context of the design paradigm by the simulafion 

Z 2 I"!; ma9 "f ' reS ° nanCe 6Xperiman,S aifned at P rabi "° « he specificity of the tertiary packing are 
d^S~,? „ f °" the f. P f PtideS " ,ni,ia, events show significant protection of amide pLnXn 

£ S^ZTf^T di S°" C ° mParable 10 GCN4 " P1 *"■*«■•<• results) (Oas, ef a/., Bioche^ 

istry 29.2891 (1990)); Goodman & Kim, Biochem. 30:11615(1991)) 

10181] Data analysis and design feedback A detailed analysis of the correspondence between the theoretical and 
exper^enta I potenfial surfaces, and hence an estimate of the accuracy of the Emulation cost funcfor ^ wa ena bted 

toXSZ ST" ft H USin9 "T 31 35 3 ~ * de ^" Performance^ 

nZ« ™!i « P n « r P,0,ted a9a,nSt 016 SeqUenCe 800res found in ■* M °" te Ca "° search (Figure 6)Whe 

Tl J^tV* • " ^ P,0, S ^° WS WhNe an eXduSive,y van der Waals scari "9 Action can screenlor 
stable sequences. ,t does not accurately predict relative stabilities. In order to address mis issue, correlations between 

Sps Ss^rShTa 8 ^ r.yr" 6 SyS,ematiCa "V using quanfitafive strudure 

SS, 2&m 3 0985)* C ° mnl0nly U8ad in StnJCtUre ^ drU9 deSi9n « H °P finaer - J - M ed. 

[0182] Table 2 lists various molecular properties of the PDA peptides in addition to the van der Waais based Monte 
Carlo scores and the expenmentally deternuned T m 's. A wide range of properties was examined. induding moleSS 
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mechanics components, such as electrostatic energies, and geometric measures, such as volume. The goal of QSAR 
is the generation of equations that closely approximate the experimental quantity, in this case T m , as a function of the 
calculated properties. Such equations suggest which properties can be used in an improved cost function. The PDA 
analysis module employs genetic function approximation (GFA) (Rogers & Hopfinger, J. Chem. Inf. Comput. Scie. 34: 
854 (1994)), a novel method to optimize QSAR equations that selects which properties are to be included and the 
relative weightings of the properties using a genetic algorithm. GFA accomplishes an efficient search of the space of 
possible equations and robustly generates a list of equations ranked by their correlation to the data. 
[0183J Equations are scored by lack of fit (LOF), a weighted least square error measure that resists overfitting by 
penalizing equations with more terms (Rogers & Hopfinger, supra). GFA optimizes both the length and the composition 
of the equations and, by generating a set of QSAR equations, clarifies combinations of properties that fit well and 
properties that recur in many equations. All of the top five equations that correct the simulation energy (E MC ) contain 
burial of nonpolar surface area, AA np (Table 3). 

Table 3. Top five QSAR equations generated by GFA with LOF, correlation coefficient and cross 
validation scores. 



QSAR equation LOF r* CV r 2 

-1.44*E MC + O.H^AA^ - 0.73*Npb 16.23 .98 .78 

-I .78+E MC + 0.20*AA np - 2.43*Rot 23. 13 .97 .75 



-1.59*E MC + (U7*AA-, - 0.05*Vol 24.57 .97 .36 



-1.54*E MC + 0.11*AA np 25.45 .91 .80 



-1.60*E MC + 0.09'AA^ - 0.12+AAp 33.88 .96 .90 



AApp and are nonpolar and polar surface buried upon folding, respectively. Vol is side chain 
volume, Npb is the number of buried nonpolar atoms and Rot isThe number of buried rotatabte bonds. 



[0184] The presence of in all of the top equations, in addition to the low LOF of the QSAR containing only E MC 
and AA np , strongly implicates nonpolar surface burial as a critical property for predicting peptide stability. This conclusion 
is not surprising given the role of the hydrophobic effect in protein energetics (Dill, Biochem. 29:7133 (1990)). 
[0185] Properties were calculated using BIOGRAF and the Dreiding forcefield. Solvent accessible surface areas 
were calculated with the Connolly algorithm (Connolly, { 1 983) (supra)) using a probe radius of 1 A A and a dot density 
of 10 A' 2 . Volumes were calculated as the sum of the van der Waals volumes of the side chains that were optimized. 
The number of buried polar and nonpolar heavy atoms were defined as atoms, with their attached hydrogens, that 
expose less than 5 A 2 in the surface area calculation. Electrostatic energies were calculated using a dielectric of one 
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(1 - 



to fit the data as smoothly as QSAR-rwith three terns anrihfn^h H , EmcMA "p equa,,on could n* be expected 
two term QSAR's had LOF scores ll^Zn J^Z T r! ^ 3 ^ ^ Valida,ed * However - a « «*" 

These resu.tsjustifythe.^ 

shov^ to perform we., (van Gunsteren. etj. JtSSSS^SS ^ ****** *** P ° M ^ bea " 
[0189] Aa np and Mp were introduced into the simulation module to correct the cost function Contrih.,«nn=» rf 

mol/A* opposing polar area burial ^l^ZrL • ■ _? Wm . ol/A favonn 9 >™P°lar area burial and 86 cal/ 
ELra^ 

terized by Sauer and c^SCj »S W SSSfSS ,1 T^TT 5"! "* n * -y *— " 
PDB file 1LMB (Beamer & Pabo j. Mo | Biol 227 177 I M2» i ! i ^ P °J were taken from 
removed from the context of the rest of the sfrSi™ 7 M ' ' des '9 nated ** 4 •» PDB file was 

hydrogens were added The h^o^re^^T^" 9 ^ and ° NA) and ^"9 BIOGRAF explicit 
V47) are Y22. L31. A37. j£Sf£tE?£ " 8 and £"11 SL" A ° f H" 8 1 ~ ^ ^ ^ ^ 
except for M42 which is 65% buried and 164 which is 2% bJS' It kT reS ' dUeS 9rea,6r *"* 80% buried 
not optimized. The other nine residues in titeSA^ht f A f ^ h3S 006 P 08S,Me rolamer and »*«» was 

residues in the 5 A sphere were allowed to take any rotamer conformation of their amino 
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acid ("floated"). The mutation sites were allowed any rotamer of the amino acid sequence in question. Depending on 
the mutant sequence, 5 x 10 16 to 7 x 10 18 conformations were possible. Rotamer energy and DEE calculation times 
were 2 to 4 minutes. The combined activity score is that of Hellinga and Richards (Hellinga, ef a/., (1994) (supra)). 
Seventy-eight of the 125 possible combinations were generated. Also, this dataset has been used to test several 

5 computational schemes and can serve as a basis for comparing different forcefields (Lee & Levitt, Nature 352:448 
(1991); van Gunsteren & Mark, supra; Hellinga, ef a/., (1994) (supra)). The simulation module, using the cost function 
found by QSAR, was used to find the optimal conformation and energy for each mutant sequence. All hydrophobic 
residues within 5 A of the three mutation sites were also left free to be relaxed by the algorithm. This 5 A sphere 
contained 1 2 residues, a significantly larger problem than previous efforts (Lee & Levitt, supra; Hellinga, (1 994) (supra)), 

10 that were rapidly optimized by the DEE component of the simulation module. The rank correlation of the predicted 
energy to the combined activity score proposed by Hellinga and Richards is shown in Figure 7. The wildtype has the 
lowest energy of the 1 25 possible sequences and the correlation is essentially equivalent to previously published results 
which demonstrates that the QSAR corrected cost function is not specific for coiled coils and can model other proteins 
adequately. 

Example 2 

Automated design of the surface positions of protein helices 

20 [0191] GCN4-pl, a homodimeric coiled coil, was again selected as the model system because it can be readily syn- 
thesized by solid phase techniques and its helical secondary structure and dimeric tertiary organization ease charac- 
terization. The sequences of homodimeric coiled coils display a seven residue periodic hydrophobic and polar pattern 
called a heptad repeat, (a b c d-e f g) (Cohen & Parry, supra). The a and d positions are buried at the dimer interface 
and are usually hydrophobic, whereas the b, c, e, f , and g positions are solvent exposed and usually polar (Figure 5). 

25 Examination of the crystal structure of GCN4-p1 (O'Shea, ef a/., supra) shows that the b, c, and f side chains extend 
into solvent and expose at least 55% of their surface area. In contrast, the e and g residues bury from 50 to 90% of 
their surface area by packing against the a and d residues of the opposing helix. We selected the 12 b, c, and f residue 
positions for surface sequence design: positions 3, 4, 7, 10, 11, 14, 17, 18, 21, 24, 25, and 28 using the numbering 
from PDB entry 2zta (Bernstein, ef a/., J. Mol. Biol. 112:535 (1977)). The remainder of the protein structure, including 

30 all other side chains and the backbone, was used as the template for sequence selection calculations. The symmetry 
of the dimer and lack of interactions of surface residues between the subunits allowed independent design of each 
subunit, thereby significantly reducing the size of the sequence optimization problem. 

[0192] All possible sequences of hydrophilic amino acids (D, E, N, Q, K, R, S, T, A, and H) for the 12 surface positions 
were screened by our design algorithm. The torsional flexibility of the amino add side chains was accounted for by 

35 considering a discrete set of alt allowed conformers of each side chain, called rotamers (Ponder, ef a/., (1 987( (supra); 
Dunbrack, et a/., Struc. Biol. Vol. 1( 5):334- 340 (1994)). Optimizing the 12 b, c, and f positions each with 10 possible 
amino acids results in 1 0 1 2 possible sequences which corresponds to 1 0 28 rotamer sequences when using the Dun- 
brack and Karplus backbone-dependent rotamer library. The immense search problem presented by rotamer sequence 
optimization is overcome by application of the Dead-End Elimination (DEE) theorem (Desmet, et ai, (1992( (supra); 

40 Desmet, ef ai, (1994) (supra); Goldstein, (1994) (supra)). Our implementation of the DEE theorem extends its utility 
to sequence design and rapidly finds the globally optimal sequence in its optimal conformation. 
[0193] We examined three potential-energy functions for their effectiveness in scoring surface sequences. Each 
candidate scoring function was used to design the b, c, and f positions of the model coiled coil and the resulting peptide 
was synthesized and characterized to assess design performance. A hydrogen-bond potential was used to check if 

45 predicted hydrogen bonds can contribute to designed protein stability, as expected from studies of hydrogen bonding 
in proteins and peptides (Stickle, et a/., supra; Huyghues-Despointes, ef a/., supra). Optimizing sequences for hydrogen 
bonding, however, often buries polar protons that are not involved in hydrogen bonds. This uncompensated loss of 
potential hydrogen-bond donors to water prompted examination of a second scoring scheme consisting of a hydrogen- 
bond potential in conjunction with a penalty for burial of polar protons (Eisenberg, (1986) (supra)). We tested a third 

so scoring scheme which augments the hydrogen bond potential with the empirically derived helix propensities of Baldwin 
and coworkers (Chakrabartty, ef a/., supra). Although the physical basis of helix propensities is unclear, they can have 
a significant effect on protein stability and can potentially be used to improve protein designs (O'Neil & DeGrado, 1990; 
Zhang, ef a/., Biochem. 30:2012 (1991); Blaber, et a!., Science 260:1637 (1993); O'shea, ef a/., 1993; Villegas, ef a/., 
Folding and Design 1:29 (1996)). A van der Waals potential was used in all cases to account for packing interactions 

55 and excluded volume. 

[01 94] Several other sequences for the b, c and f positions were also synthesized and characterized to help discern 
the relative importance of the hydrogen-bonding and helix-propensity potentials. The sequence designed with the 
hydrogen-bond potential was randomly scrambled, thereby disrupting the designed interactions but not changing the 
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he|,x propensity of the sequence. Also, the sequence with the maximum possible helix propensity, all positions set to 
alanine, was made. Finally, to serve as undesigned controls, the naturally occurring GCN4-p1 sequence and a se- 
quence randomly selected from the hydrophilic amino add set were synthesized and studied 

S Se f ™ e SCOri , ng fUnCH ° nS and ° EE: Th8 protein slnjclure was modeled on the backbone co- 

ordinate* of GCN4-p1 PDB record 2zta (Bernstein, et a/., supra; O'Shea. et a/., supra). Atoms of all side chains not 
optimeed were left in their crystallographically determined positions. The program BIOGRAF (Molecular Simulations 
Incorporated. San Diego. CA) was used to generate explicit hydrogens on the structure which was then conjugate 

»oJ . f e T S ' n9 DREIDING forcefleW ( Ma ^ * a/.. 1990. supra). The symmetry of the diner 
and la* of interactions of surface residues between the subunits allowed independent design of each subuniL All 
computahons were done using the first monomer to appear in 2zta (chain A). A backbone-dependent rotamer library 
was used (Dunbrack, et a/. (1 993) (supra)). c 3 angles that were undetermined from the database steBsteWeS 
s,gnedftefollow I ngvalue S :Arg.-60^60°.and180-G^n.-120^^^0•.60^120•.and180••C3lu 0° 60° and120°- 
Lys, -60°, 60' and 180°. c 4 angles that were undetermined from the database statistics were assigned the following 
values: Arg. -120°. -60°. 60°. 120°. and 180°; Lys. -60°. 60°. and 180°. Rotamers with combinations of LarSS 
resulted in sequenhal gVg- or g7g + angles were eliminated. Uncharged His rotamers were used. A Lennard-Jones 

by Cal ech saenftsts, new press release (1 997) was used for van der Waals interactions. The hydrogen bond potential 
consKtedofad^ 

bond potenbal ,s based on the potential used in DREIDING, with more restrictive angle-dependent tenia toSe 

The donTana S" - ^ tan " °" the hytriSon "L^ 

the donor and acceptor, as shown in Equations 10 to 13, above. 

LTpLm SS?°!Jf V 3 ' 9 iS f e d ^ n ° r - hydra9en - acceptor an 9 ,e - * ls •» hydrogen-acceptor-base angle (the base 
■s the atom attached to the acceptor, for example the carbonvl carbon is the base for a carbonyl oxygen acceptor) and 
£ is the angle between the normals of the planes defined by the six atoms attached to the so* centem (the supplement 

?i«T^5Sr F 'T^" ^ h y dr °9 e "- tond h «* evaluated when 2.6 A < R < 3 2 A^> 9^ 

f- 109.5° < 90° for the sp3 donor- Sp 3 acceptor case. and. * > 90° for the sp3 donor- sp* acceptor case- no s^tcNno 

ESSE? Tt emPla ! e ^ aCOept0rS ^ Were iW0lved in template-templatehydroge^ bonds wlm 

not included ,n the donor and acceptor lists. For the purpose of exclusion, a template-template hydrogen bond wal 
consrdercdtoe^ 

was only applied to buried polar hydrogens not involved in hydrogen bonds, where a hydrogen bond was «S£3 

SSL ^ '? SS * TJ kCal/m0, • ™ S ^ n0t app,ied to ,emplate ^r^ens. TliTSSSS 
potential was also supplemented with a weak coulombic term that included a distance^ependent dielecfe c^sS 

form^cham": JT TZ? *** W ™ «* applied to ^functional groups. A net 

formal charge of + 1 was used for Arg and Lys and a net formal charge of -1 was used for Asp and Gu Energies 
associated with a-hel.cal propensities were calculated using equation 14. above. In Equation 14 E is the enerov of 
^.c^pror^y^^^ 

free energy of helix propagation of alanine used as a standard, and Nss is the propensity scale fac£ JJch was set 
to3.0.Th,spotential was setectedin order to scale the properuaty energies to a Llr i as 

b SI noSESSL 00 * 12 Pr0CeSS ° r ' R1000CM ' ased Silic ° n <*"»**» p °wer Challenge or 

K«H Sd 'S'JKTS 3nd f"^ 0 " and CD ^'y 5 ' 5 ^ as in Example 1. NMR samples were prepared in 
90/10 H 2 CVD 2 0 and 50 mM sodium phosphate buffer at pH 7.0. Spectra were acquired on a Varian Unftyplus 600 MHz 
spectrometer a. 25 °C. 32 transients were acquired with 1.5 seconds erf so. vent pU^^ 

sron. Samples were 1mM.Sizeexdusionc^omatography was performed withaP 
<™9™)atpH7.0in50mMprK,spha«eand150mM^^ 

'n f ° r ^ ' ^ 5 in ' ec,ions * 1 ""^peptide Jul ^0^ 

tographed at 0.50 ml/nun and monitored af 214 nm. Samples were run in triplicate 

[0199] The surface sequences of all of the peptides examined in this study are shown in Table 4. 
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Table 4. Sequences and properties of the synthesized peptides 



Peptide 


Design method 


Surface Sequence 


T m 




N 






bcfbcfbcf bcf 




CC) 


(to 


GCN4-p1 


none 


KQD EES YHN ARK 


57 


3.831 


2 


6A 


HB 


EKDRER RRE RRE 


71 


2.193 


2 


6B 


HB + PB 


EKQ KER ERE ERQ 


72 


2.868 


2 


6C 


HB* HP 


ARA AAA RRR ARA 


69 


-2.041 


2 


6D 


scrambled KB 


REERRR EDRKRE 


71 


2.193 


2 


6E 


random polar 


NTRAKSANHNTQ 


15 


4.954 


2 


6F 


poly(Ala) 


AAA AAA AAA AAA 


73 


-3.096 


4 



For darity only the designed surface residues are shown and they are grouped by position (b, c, 
and f). The sequence numbers of the designed positions are: 3, 4, 7, 10, 11. 14, 17, 18, 21 , 24,' 
25. and 28. Melting temperatures (T^'s) were determined by drcular dtchroism and * 
oligomeri2ation states (N) were determined by size exclusion chromatography. £AG° is the sum 
of the standard free energy of helix propagation of the 12 b. c, and f positions (Chakrabartty, et 
a/., 1994). Abbreviations for design methods are: hydrogen bonds (HB), polar hydrogen burial 
penalty (PB). and helix propensity (HP). 



[0200) Sequence 6A, designed with a hydrogen-bond potential, has a preponderance of Arg and Glu residues that 
are predicted to form numerous hydrogen bonds to each other. These long chain amino acids are favored because 
they can extend across turns of the helix to interact with each other and with the backbone. When the optimal geometry 
of the scrambled 6A sequence, 6D, was found with DEE, far fewer hydrogen bonding interactions were present and 
its score was much worse than 6A's. 6B, designed with a polar hydrogen burial penalty in addition to a hydrogen-bond 
potential, is still dominated by long residues such as Lys, Glu and Gin but has fewer Arg. Because Arg has more polar 
hydrogens than the other amino acids, it more often buries nonhydrogen-bonded protons and therefore is disfavored 
when using this potential function. 6C was designed with a hydrogen-bond potential and helix propensity in the scoring 
function and consists entirely of Ala and Arg residues, the amino acids with the highest helix propensities (Chakrabartty, 
ef a/., supra). The Arg residues form hydrogen bonds with Glu residues at nearby e and g positions. The random 
hydrophilic sequence, 6E, possesses no hydrogen bonds and scores very poorly with all of the potential functions used. 
[0201 J The secondary structures and thermal stabilities of the peptides were assessed by circular dichroism (CD) 
spectroscopy. The CD spectra of the peptides at 1 °C and 40 uM are characteristic of a helices, with minima at 208 
and 222 nm, except for the random surface sequence peptide 6E. 6E has a spectrum suggestive of a mixture of a 
helix and random coil with a [ej 222 + of -12000 deg cnvVdmol, while all the other peptides are greater than 90% helical 
with [e] 22 2 of less than -30000 deg crtf/dmo!. The melting temperatures (T m 's) of the designed peptides are 12-16 °C 
higher than the T m of GCN4-p1 , with the exception of 6E which has a T m of 1 5 °C. CD spectra taken before and after 
melts were identical indicating reversible thermal denaturation. The redesign of surface positions of this coiled coil 
produces structures that are much more stable than wildtype GCN4-p1 , while a random hydrophilic sequence largely 
disrupts the peptide's stability. 

[0202] Size exclusion chromatography (SEC) showed that all the peptides were dimers except for 6F, the all Ala 
surface sequence, which migrated as a tetramer. These data show that surface redesign did not change the tertiary 
structure of these peptides, in contrast to some core redesigns (Harbury, et a/., supra). In addition, nuclear magnetic 
resonance (NMR) spectra of the peptides at 1 mM showed chemical shift dispersion similar to GCN4-p1 (data not 
shown). 

[0203] Peptide 6A, designed with a hydrogen-bond potential, melts at 71 °C versus 57 °C for GCN4-p1 , demonstrat- 
ing that rational design of surface residues can produce structures that are markedly more stable than naturally oc- 
curring coiled coils. This gain in stability is probably not due to improved hydrogen bonding since 6D, which has the 
same surface amino acid composition as 6A but a scrambled sequence and no predicted hydrogen bonds, also melts 
at 71 °C. Further, 6B was designed with a different scoring function and has a different sequence and set of predicted 
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hydrogen bonds but a very similar T m of 72 °C. 

[02041 An alternative explanation for the increased stability of these sequences relative to GCN4-p1 is their higher 
helix Property The long polar residues selected by the hydrogen bond potential, Lys, Glu, Arg and Gin are also 
among the bes hehx formers (Chakrabartty. ef a/., supra). Since the effect of helix propensity is not as dependent on 

cTZ* T °° 35 3t °I ^ dr09en b ° nding ' eSpeci3,,y far *• "*< ends - littte would be expected 
scrambling the sequence of 6A. A rough measure of the helix propensity of the surface sequences, the sum of the 

^ ™ T xf I!*! ifWtlon (EAG°) (Chakrabartty. ef a/., supra), corresponds to the peptides' thermal 

[0205] Peptide 6C was designed with helix propensity as part of the scoring function and it has a IAG° of -2 041 

Sri 9 J" £°? * GCN4 - p1 ' i,S T «> ° f 69 ° C fe lower «ian 6A and 6B, in spite of 6Cs 

Ngherhe.«propens l ty.S l m..arl^ 

-3.096 kcaVmol, but rts T m of 73 »C is only marginally higher than that of 6A or 6B. 6F also migrates as a tetramer 
dunng SEC, not a dimer. likely because its poly(Ala) surface exposes a large hydrophobic patch that could mediate 
association. Though the results for 6C and 6F support the conclusion that helix propensity is ImportaXTuTe 
des gn they point out possible limitations in using propensity exclusively. Increasing propensity does not necessanV 
confer he greatest stability on a structure, perhaps because other factors are being effected unfa^raWy Mo as is 
evident from 6F. changes in the tertiary structure of the protein can occur. 

[ ? 2 K-f I y he u ch . aractenza,to " ° f »«>se peptides clearly shows that surface residues have a dramatic impact on the 
SHSS TnifJ^ 'rr.^ T tan hydroph " iC SeqUenCe (Tm 1 5 ° C) and des *> ned sequences ^ 

SSTET If- 199 k : ""i* e ' a/ - 1991: Minor - e ' a/ - (1994) (supra): ^ achncSSSS 

1995))^ Furthe these designs have significantly higher T m 's than the wildtype GCN4-p1 seque^cTdemonstratina 
ttjatsurWesrtues can be used to improve stability in protein design (O'shea.efa/.. supra). Though ^1^3 
appears to be more important than hydrogen bonding in stabilizing the designed coiled coils, hydrogen 23 
be .mportant in the design and stabilization of other types of secondary structure. 

Example 3 

3 P ro, f n , containin 9 surface and boundary residues using van der Waals, H-bonding, secondary 
structure and solvation sconng functions y 

[0207] In this example, core, boundary and surface residue work was combined. In selecting a motif to test the 
integration of our design methodologies, wesou^^^^^ 

and experimentally tractable, yet large enough to form an independently folded structure in the Z3d2 
bonds or mete, binding sites. We chose the pp« motif ^b.^^DH^^T^^S 

recent wori< by Imperial, and coworkers who designed a 23 residue peptide, containing an unusual amino acid (D 
(PDB (Bernstein, ef a/.. 1977) was examined for high resolution structures of the BBa motif and the second zinc finaer 

^ZSFXLTS '"*Z za * ( , PDB 1zaa) was selec,ed as « 255^525 5S 

Urn III T M " Si f - ° f SeC ° nd m ° dUle a,i9ns Vefy close| y wth ,he 0<h e r «*° zinc fingers in 2f268 and 
with zinc fingers in other proteins and is therefore representative of this fold class. 28 residues were taten from the 
crystal structure starting at lysine 33 in the numbering of PDB entry 1zaa which conesponds to^rpositiorT" The fire! 
12 residues compnse the B sheet with a tight turn at the 6* and 7* positions. Two re^dues connect toshee^ ie 
helix, wh.ch extends through position 26 and is capped by the last twdresidues 

[0208] In order to assign the residue positions in the template structure into core, surface or boundary classes the 

ZiJZnTi K?5l ^JSST ^ T " be aSSi9ned unambi g uous 'y '0 •» core while six residues 

STSM\JLf' ' ; **Z 5 [ *T, daSSrfied aS b0Undary - Three of residu « are from the sheet (positions 
3. 5. and 12) and four are from the helix (positions 18. 21. 22. and 25). One of the zinc binding residues of ZMk! 
me core and two are in the boundary, but the fourth, position 8. has a C„-CB r«Cl SJ 
geometnc ceme, ^nd ,s therefore classified as a surface position. Theother surface positions cor^iLrXmrdesan 

hehx ends. The remaming exposed positions, which either were in turns, had irregular backbone diheSoTwere 
partially buned. were not included in the sequence selection for this initial study. As in our previous sludge am"o 
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acids considered at the core positions during sequence selection were A, V, L, I, F, Y, and W; the amino acids considered 
at the surface positions were A, S, T, H, D, N, E. Q, K t and R; and the combined core and surface amino acid sets (16 
amino acids) were considered at the boundary positions. 

[0209] In total, 20 out of 28 positions of the template were optimized during sequence selection. The algorithm first 
selects Gly for all positions with $ angles greater than 0° in order to minimize backbone strain (residues 9 and 27). The 
18 remaining residues were split into two sets and optimized separately to speed the calculation. One set contained 
the 1 core, the 6 boundary positions and position 8 which resulted in 1.2 x 10 9 possible amino acid sequences corre- 
sponding to 4.3 x 10 19 rotamer sequences. The other set contained the remaining 10 surface residues which had 10 10 
possible amino acid sequences and 4.1 x 10 23 rotamer sequences. The two groups do not interact strongly with each 
other making their sequence optimizations mutually independent, though there are strong interactions within each 
group. Each optimization was carried out with the non-optimized positions in the template set to the crystallographic 
coordinates. 

[0210] The optimal sequences found from the two calculations were combined and are shown in Figure 8 aligned 
with the sequence from the second zinc finger of Zif268. Even though all of the hydrophilic amino acids were considered 
at each of the boundary positions, only nonpolar amino acids were selected. The calculated seven core and boundary 
positions form a well-packed buried cluster. The Phe side chains selected by the algorithm at the zinc binding His 
positions, 21 and 25, are 80% buried and the Ala at 5 is 100% buried while the Lys at 8 is greater than 60% exposed 
to solvent. The other boundary positions demonstrate the strong steric constraints on buried residues by packing similar 
side chains in an arrangement similar to 2f268. The calculated optimal configuration buried 830 A 2 of nonpolar 
surface area, with Phe 12 (96% buried) and Leu 18 (88% buried) anchoring the cluster. On the helix surface, the 
algorithm positions Asn 14 as a helix N-cap with a hydrogen bond between its side-chain carbonyl oxygen and the 
backbone amide proton of residue 16. The six charged residues on the helix form three pairs of hydrogen bonds, 
though in our coiled coil designs helical surface hydrogen bonds appeared to be less important than the overall helix 
propensity of the sequence. Positions 4 and 1 1 on the exposed sheet surface were selected to be Thr, one of the best 
P-sheet forming residues (Kim & Berg, 1993; Minor, et a/., (1994) (supra); Smith, ef a/., (1995) (supra)). 
[021 1] Combining the 20 designed positions with the Zif268 amino acids at the remaining 8 sites results in a peptide 
with overall 39% (11/28) homology toZif268, which reduces to 15% (3/20) homology when only the designed positions 
are considered. A BLAST (Altschul, era/., 1990) search of the non-redundant protein sequence database of the National 
Center for Biotechnology Information finds weak homology, less than 40%, to several zinc finger proteins and fragments 
of other unrelated proteins. None of the alignments had significance values less than 0.26. By objectively selecting 20 
out of 28 residues on the Zif268 template, a peptide with little homology to known proteins and no zinc binding site 
was designed. 

[0212] Experimental characterization: The far UV circular dichroism (CD) spectrum of the designed molecule, 
pda8d, shows a maximum at 1 95 nm and minima at 218 nm and 208 nm, which is indicative of a folded structure. The 
thermal melt is weakly cooperative, with an inflection point at 39 °C, and is completely reversible. The broad melt is 
consistent with a low enthalpy of folding which is expected for a motif with a small hydrophobic core. This behavior 
contrasts the uncooperative transitions observed for other short peptides (Weiss & Keutmann, 1990; Scholtz et al 
PNAS USA 88:2854 (1991); Struthers, et a/., J. Am. Chem. Soc. 118:3073 (1996b)). 

[021 3] Sedimentation equilibrium studies at 1 00 uM and both 7 °C and 25 °C give a molecular mass of 3490, in good 
agreement with the calculated mass of 3362, indicating the peptide is monomeric. At concentrations greater man 500 
UM, however, the data do not fit well to an ideal single species model. When the data were fit to a monomer-dimer- 
tetramer model, dissociation constants of 0.5 - 1.5 mM for monomer-to-dimer and greater than 4 mM for dimer-to- 
tetramer were found, though the interaction was too weak to accurately measure these values. Diffusion coefficient 
measurements using the water-sLED pulse sequence (Altieri, et a/., 1995) agreed with the sedimentation results: at 
100 uM pda8d has a diffusion coefficient close to that of a monomeric zinc finger control, while at 1 .5 mM the diffusion 
coefficient is similar to that of protein G p1 , a 56 residue protein. The CD spectrum of pda8d is concentration independent 
from 10 jiM to 2.6 mM. NMR COSY spectra taken at 2.1 mM and 100 u.M were almost identical with 5 of the Hct-HN 
crosspeaks shifted no more than 0.1 ppm and the rest of the crosspeaks remaining unchanged. These data indicate 
that pda8d undergoes a weak association at high concentration, but this association has essentially no effect on the 
peptide's structure. 

[0214] The NMR chemical shifts of pda8d are well dispersed, suggesting that the protein is folded and well-ordered. 
The Ha-HN fingerprint region of the TOCSY spectrum is well-resolved with no overlapping resonances (Figure (9A) 
and all of the Ha and HN resonances have been assigned. NMR data were collected on a Varian Unityplus 600 MHz 
spectrometer equipped with a Nalorac inverse probe with a self-shielded z-gradient. NMR samples were prepared in 
90/10 H 2 0/D 2 0 or 99.9% D 2 0 with 50 mM sodium phosphate at pH 5.0. Sample pH was adjusted using a glass 
electrode with no correction for the effect of D 2 0 on measured pH. All spectra for assignments were collected at 7 °C. 
Sample concentration was approximately 2 mM. NMR assignments were based on standard homonuclear methods 
using DQF-COSY, NOESY and TOCSY spectra (Wuthrich, NMR of Proteins and Nucleic Acids (John Wiley & Sons 
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DOfIoV 9 ^° ESY TOCSY SP6Ctra ^ aCqUired Wth 2K P° ints in F 2 512 increments in F1 and 
DQF-COSY spectra were acqu.red with 4K points in F2 and 1024 increments in F1 All snectra were a ™T,iL lf h 

TCM^SY srccba'w^'^' ^ ^^h^ transients. NOESY spectra were recorded with mixing Bn^s^fToo^^^^OOms^and 
TOCSY spectra were recorded with an isotropic mixing time of 80 ms In TOCSY and noF rn<?v *lZZ . 
pression was achieved by presaturation during a relaxation delay on' 5 and 2^ rl™^ w SUP " 
the NOESY spectra was accomplished wHh I WATERGATE^ JS^I^HKS 
were referenced to the HOD resonance. Spectra were zero-filled in both F2 and FlZ^ 

Tdof .ci^r ne bei1 ^ fi (noesy and tocsy) - a 300 — 2?s r^^tsrs? 

D 0 5 ih^ te eXpe : ime " ,S (A,Ueri ' ef *■ 1995 > ™ at 25 «C at 1.5 mM. 400 pM and 100 pM in 99 9% 

D 2 Q with 50 mM sodium phosphate at pH 5.0. Axial gradient field strength was varied from 3 26 (to 51 11 r/™ 7 7 

x^7 L^f f , f JS r,^' emS (AM,en ' * a/ ' 1 " 5) - Diffusion Sclents were 1.48 x 10"? 1 62 xTi' and173 
P216] All unambiguous sequential and medium-range NOEs are shown in Figure 9A Ha-HN and/or HM-HN mop* 



Table 5. 



Tin I de,erm,nabon ° f P da8d: di ^"ce restraints, structural statistics, atomic root-mean-sauarefrms! 
dev,at.ons.andc^panson to thedesign target <SA> are the32simulatedann; a lir^sto 
structure and SD is the standard deviation. The design target is the backbone of X SA,S,heavera 9 e 



Distance restraints 



Intraresidue 
Sequential 
Short range = 2-5 residues) 
Long range > 5 residues) 
Total 



Structural statistics 



Rms deviation from distance restraints (A) 
Rms deviation from idealized geometry (A) 
Bonds (A> 



Impropers (degrees) 



Atomic rms deviations (A)* 



148 
94 
78 
34 
354 



<SA>±SD 



0.049 ±.004 

0.0051 ± 0.0004 
0.76+0.04 
0.56 ±0.04 



<SA> vs. SA ± SD 



0.55 ±0.03 



•Atomic rms deviations are far residues 3 to 26. inclusive The termini * o o-r 771 " 

sequential or. non-intraresidue. contacts. ° CS 1 ' 2 ' 27 ' and 28 ' were ***** disordered and had very few non- 
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Table 5. (continued) 




NMR structure determination of pda8d: distance restraints, structural statistics, atomic root-mean-square(rms) 
deviations, and comparison to the design target <SA> are the 32 simulated anneaJing structures, SA is the average 
structure and SD is the standard deviation. The design target is the backbone of Zif268. 


Atomic rms deviations (A)* 






<SA>vs. SA±SD 


Backbone + nonpolar side chains 
Heavy atoms 




1.05 ±0.06 
125 ±0.04 


Atomic rms deviations between pda8d and the design target (A)* 






SA vs. target 


Backbone 
Heavy atoms 




1.04 

2.15 | 



•Atomic rms deviations are for residues 3 to 26, inclusive. The termini, residues 1. 2. 27, and 28, were highly disordered and had very few non- 
sequential or. non-intraresidua contacts. 



[021 8] The NMR solution structure of pda8d shows that it folds into a bba motif with well-defined secondary structure 
elements and tertiary organization which match the design target. A direct comparison of the design template, the 
backbone of the second zinc finger of Zif268, to the pda8d solution structure highlights their similarity (data not shown). 
Alignment of the pda8d backbone to the design target is excellent, with an atomic rms deviation of 1.04 A (Table 5). 
Pda8d and the design target correspond throughout their entire structures, including the turns connecting the secondary 
structure elements. 

[0219] In conclusion, the experimental characterization of pda8d shows that it is folded and well-ordered with a 
weakly cooperative thermal transition, and that its structure is an excellent match to the design target. To our knowledge, 
pda8d is the shortest sequence of naturally occurring amino acids that folds to a unique structure without metal binding, 
oligomerization or disulfide bond formation (McKnight, et a/., Nature Struc. Biol. 4:180 (1 996)). The successful design 
of pda8d supports the use of objective, quantitative sequence selection algorithms for protein design. This robustness 
suggests that the program can be used to design sequences for de novo backbones. 

Example 4 

Protein design using a scaled van der Waals scoring function in the core region 

[0220] An ideal model system to study core packing is the 01 immunoglobufin-binding domain of streptococcal protein 
G (Gpi) (Gronenbom, et a/., Science 253:657 (1991); Alexander, era/., Biochem. 31: 3597(1992); Barchi, et a/. f Protein 
Sci. 3:15 (1994); Gallagher, et a/., 1994; Kuszewski, et a/., 1994; Orban, et a/., 1995). Its small size, 56 residues, 
renders computations and experiments tractable. Perhaps most critical for a core packing study, Gpi contains no 
disulfide bonds and does not require a cofactor or metal ion to fold. Further, Gpi contains sheet, helix and turn structures 
and is without the repetitive side-chain packing patterns found in coiled coils or some helical bundles. This lack of 
periodicity reduces the bias from a particular secondary or tertiary structure and necessitates the use of an objective 
side-chain selection program to examine packing effects. 

[0221] Sequence positions that constitute the core were chosen by examining the side-chain solvent accessible 
surface area of G01, Any side chain exposing less than 10% of its surface was considered buried. Eleven residues 
meet this criteria, with seven from the p sheet (positions 3, 5, 7, 20, 43, 52 and 54), three from the helix (positions 26, 
30, and 34) and one in an irregular secondary structure (position 39f. These positions form a contiguous core. The 
remainder of the protein structure, including all other side chains and the backbone, was used as the template for 
sequence selection calculations at the eleven core positions. 

[0222] All possible core sequences consisting of alanine, valine, leucine, isoleucine, phenylalanine, tyrosine or tryp- 
tophan (A. V. L, I, F, Y or W) were considered. Our rotamer library was similar to that used by Desmet and coworkers 
(Desmet, et ai, (1992) (supra)). Optimizing the sequence of the core or Gb1 with 217 possible hydrophobic rotamers 
at all 11 positions results in 21 7 11 , or 5x1 0 25 , rotamer sequences. Our scoring function consisted of two components: 
a van der Waals energy term and an atomic solvation term favoring burial of hydrophobic surface area. The van der 
Waals radii of all atoms in the simulation were scaled by a factor a (Eqn. 3) to change the importance of packing effects. 
Radii were not scaled for the buried surface area calculations. By predicting core sequences with various radii scalings 
and then experimentally characterizing the resulting proteins, a rigorous study of the importance of packing effects on 
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protein design is possible. 

I0223J The protein structure was modeled on the backbone coordinates of G|}1, PDB record Ipga (Bernstein era/ 
supra; Gather, et a/.. 1 994). Atoms of all side chains not optimized were. eft h^r ^.09^X^2^ 
pos.tions. The program BIOGRAF (Molecular Simulations Incorporated. San Diego. CA^was Sed togtlate 
hydrogens on fte structure which was then conjugate gradient minimized for 50 steps using the Dreiding forSdd 
Mayo, et a/.. 1990 supra) The rotamer library. DEE optimization and Monte Carlo search was as ouSed Xve A 
Lennarcklones 12-6 potential was used for van der Waals interactions, with atomic radii scaled for the various cases 
as discussed herem. The Richards definition of solvent-accessible surface a^ 

ZoIT CalC,J,a,ed r ^' t, ; !? ^ n "°" y , al90rithrn <°»«"* < 1983 > ("WW- An atomic solvation palter" deriS 

P T™ S WOrit ' ° f f 3 M ' k W3S US6d to faW Mrophobic burial and to penalize solvent exposure^ 

SStS > n ° nP ° eXP T re ° Ur °P timiza « on frame *°*. fi^t consider the total hydrophobic area 

ZZ S IS ,° ^ TWS 6XPOSUre iS deCfeaSed by ^ «** buned in ~tamer/templaW contacts and 

the sum of the areas buried in pairwise rotamer/rotamer contacts 

ELtion^S7rrh,T nT VariOUS Valuesof 108 radiusscali "9^ctDrawere found using the Dead-End 
El.m,nabon tiieorem (Table 6). Optimal sequences, and their corresponding proteins, are named by L radius scale 
factor used ,n their design. For example, me sequence designed with a radius scale factor of « = 0 90?s 2S So 

Table 6. 



Gpi sequence 



a 


vol 


TY 
R 


LE 
U 


LE 
U 


AL 
A 


AL 
A 


PH 
E 


AL 
A 


VA 
L 


TR 
P 


PH 
E 


VA 
L 






3 


5 


7 


20 


26 


30 


34 


39 


43 


52 


54 


0.7 
0 


1.2 
8 


TR 
P 


TY 
R 


ILE 


ILE 


PH 
E 


TR 
P 


LE 
U 


ILE 


PH 
E 


LE 
U 


ILE 


0.7 
5 


1.2 
3 


PH 
E 


ILE 


PH 
E 


ILE 


VA 
L 


TR 
P 


VA 
L 


LE 
U 




I 


ILE 


0.8 
0 


1.1 
3 


PH 

E 


I 


ILE 


I 






ILE 


ILE 




TR 
P 


ILE 


0.8 
5 


1.1 
5 


PH 
E 


I 


ILE 


I 






LE 
U 


ILE 




TR 
P 


PH 
E 


0.9 
0 


1.0 
1 


PH 
E 


I 


ILE 


I 






I 


ILE 




I 




0.9 
5 


1.0 
1 


PH 
E 


I 


ILE 


I 






I 


ILE 




I 




1.0 


0.9 
9 


PH 
E 


I 


VA 
L 


I 






I 


ILE 




I 




1.0 
5 


0.9 
3 


PH 
E 


I 


AL 

A | 


I 






I 


I 




I 




1.0 
75 


0.8 
3 


AL 
A 


AL 
A 


ILE 


I 




ILE 


I 


I 




ILE 


ILE 


1.1 
0 


0.7 
7 


AL 
A 


I 


AL 
A 


I 




AL 
A 


I 


I 




ILE 


ILE 


1.1 
5 


0.6 

8 I 


AL 
A 


AL 
A 


AL 
A 


I 




AL 
A 


I 


I 




LE 
U 


I 



E? ? r f 1 ,of S6qUenCe and posiHon numbere ™ shown at the top. vol is the fracton of core side-chain 
volume relative to the Gp1 sequence. A vertical bar indicates identity with the G?! sequence 

SL deSi9ne , d ** h 0=10 and hence Sen,es as 3 baseli " e <™ ™ incorporation of steric effects The 

occumng sequence was used in the s.de-ch a n selection algorithm. Variation of o from 0.90 to 1 .05 caused little change 
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in the optima! sequence, demonstrating the algorithm's robustness to minor parameter perturbations. Further, the pack- 
ing arrangements predicted with a = 0.90 - 1.05 closely match Gp1 with average % angle differences of only 4° from 
the crystal structure. The high identity and conformational similarity to G01 imply that, when packing constraints are 
used, backbone conformation strongly determines a single family of well packed core designs. Nevertheless, the con- 
straints on core packing were being modulated by a as demonstrated by Monte Carlo searches for other low energy 
sequences. Several alternate sequences and packing arrangements are in the twenty best sequences found by the 
Monte Carlo procedure when a = 0.90. These alternate sequences score much worse when a = 0.95, and when a = 
1 .0 or 1 .05 only strictly conservative packing geometries have low energies. Therefore, a = 1 .05 and a = 0.90 define 
the high and low ends, respectively, of a range where packing specificity dominates sequence design. 
K [0227] For a <0.90, the role of packing is reduced enough to let the hydrophobic surface potential begin to dominate, 
thereby increasing the size of the residues selected for the core (Table 6). A significant change in the optimal sequence 
appears between a = 0.90 and 0.85 with both a85 and a80 containing three additional mutations relative to a90. Also, 
a85 and a80 have a 15% increase in total side-chain volume relative to Gb1. As a drops below 0.80 an additional 10% 
increase in side-chain volume and numerous mutations occur, showing that packing constraints have been over- 
whelmed by the drive to bury nonpolar surface. Though the jumps in volume and shifts in packing arrangement appear 
to occur suddenly for the optimal sequences, examination of the suboptimal low energy sequences by Monte Carlo 
sampling demonstrates that the changes are not abrupt. For example, the a85 optimal sequence is the 11^ best se- 
quence when a = 0.90, and similarly, the a90 optimal sequence is the 9 th best sequence when a = 0.85. 
[0228] For a > 1 .05 atomic van der Waals repulsions are so severe that most amino acids cannot find any allowed 
packing arrangements, resulting in the selection of alanine for many positions. This stringency is likely an artifact of 
the large atomic radii and does not reflect increased packing specificity accurately. Rather, a = 1.05 is the upper limit 
for the usable range of van der Waals scales within our modeling framework. 

[0229] Experimental characterization of core designs. Variation of the van der Waals scale factor a results in four 
regimes of packing specificity: regime 1 where 0.9 <, a < 1 .05 and packing constraints dominate the sequence selection; 
regime 2 where 0.8 £ a < 0.9 and the hydrophobic solvation potential begins to compete with packing forces; regime 
3 where a < 0.8 and hydrophobic solvation dominates the design; and, regime 4 where a > 1.05 and van der Waals 
repulsions appear to be too severe to allow meaningful sequence selection. Sequences that are optimal designs were 
selected from each of the regimes for synthesis and characterization. They are a 90 from regime 1 , a 85 from regime 
2, a 70 from regime 3 and a 107 from regime 4. For each of these sequences, the calculated amino acid identities of 
the eleven core positions are shown in Table 6; the remainder of the protein sequence matches Gf*1. The goal was to 
study the relation between the degree of packing specificity used in the core design and the extent of native-like char- 
acter in the resulting proteins. 

[0230] Peptide synthesis and purification. With the exception of the eleven core positions designed by the se- 
quence selection algorithm, the sequences synthesized match Protein Data Bank entry 1 pga. Peptides were synthe- 
sized using standard Fmoc chemistry, and were purified by reverse-phase HPLC. Matrix assisted laser desorption 
mass spectrometry found all molecular weights to be within one unit of the expected masses. 
[0231 J CD and fluorescence spectroscopy and size exclusion chromatography. The solution conditions for all 
experiments were 50 mM sodium phosphate buffer at pH 5.5 and 25 °C unless noted. Circular dichroism spectra were 
acquired on an Aviv 62DS spectrometer equipped with a thermoelectric unit. Peptide concentration was approximately 
20 u,M. Thermal melts were monitored at 218 nm using 2° increments with an equilibration time of 120 s. T r s were 
defined as the maxima of the derivative of the melting curve. Reversibility for each of the proteins was confirmed by 
comparing room temperature CD spectra from before and after heating. Guanidinium chloride denaturation measure- 
ments followed published methods (Pace, Methods. Enzymol. 131:266 (1986)). Protein concentrations were deter- 
mined by UV spectrophotometry. Fluorescence experiments were performed on a Hitachi F-4500 in a 1 cm pathlength 
cell. Both peptide and ANS concentrations were 50 jiM. The excitation wavelength was 370 nm and emission was 
monitored from 400 to 600 nm. Size exclusion chromatography was performed with a PolyLC hydroxyethyl A column 
at pH 5.5 in 50 mM sodium phosphate at 0 °C. Ribonuclease A, carbonic anhydrase and G|J1 were used as molecular 
weight standards. Peptide concentrations during the separation were~ 15 uM as estimated from peak heights moni- 
tored at 275 nm. 

[0232J Nuclear magnetic resonance spectroscopy. Samples were prepared in 90/1 0 H 2 0/D 2 0 and 50 mM sodium 
phosphate buffer at pH 5.5. Spectra were acquired on a Varian Unityplus 600 MHz spectrometer at 25 °C. Samples 
were approximately 1 mM, except for a70 which had limited solubility (100 uM). For hydrogen exchange studies, an 
NMR sample was prepared, the pH was adjusted to 5.5 and a spectrum was acquired to serve as an unexchanged 
reference. This sample was lyophilized, reconstituted in D 2 0 and repetitive acquisition of spectra was begun immedi- 
ately at a rate of 75 s per spectrum. Data acquisition continued for 20 hours, then the sample was heated to 99 °C 
for three minutes to fully exchange ail protons. After cooling to 25 °C, a final spectrum was acquired to serve as the 
fully exchanged reference. The areas of all exchangeable amide peaks were normalized by a set of non-exchanging 
aliphatic peaks. pH values, uncorrected for isotope effects, were measured for all the samples after data acquisition 
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ro23?i Ton h L k ° C ° neCt J° r mm ° r drfferEnCeS in PH (Rohl - et * • Biochem - 31:1263(1992)). 

[0233] a 90 and a 85 have elites and spectra very similar to Gb1 (not shown), suggesting that their secondary 
^*"n te ntiscomparab^^ 

Thermal melts monitored by CD are shown in Figure 10B. «85 and a 90 both have cooperative transitions with melBna 
temperatures (T m 's) of 83 °C and 92 «C. respectively. « 107 shows no thermal transftton behTo ZeTtedTmm a 
my unfolded polypeptide, and a 70 has a broad, shailow transition, centered at n kl^^T^Z 
foWedstrucmres. Relative ,0Gb1.whichhasaT m crf87-C(AiexarKier.e f a/.. supraU 

E£i linv" Stab ' eChen,iCal denatUra80n '"^^^^^^^^1^^^ 
[0234] a 90"ha 8 a larger AG U than that reported for G P 1 (Alexander, ef a/., supra) while a 85 is slightly less stable 
1^. °^° SS,ble ,0 meaSUfe AG " ° 70 * a1 07 because 'ack discernible transitions 
[0235] The extent of chemical shift dispersion in the proton NMR spectrum of each protein was assessed to oai™ 

hallmark ^f a well-ordered native protein, a 85 has diminished chemical shift dispersion L peaL mat ar^omewha^ 
^ ^ ^ 1 su " es,ina 3 T»derate»y mobile structure that nevertheless maintetasa d^nrt foW ^70^ 
NMR spectrum has almost no dispersion. The broad peaks are indicative of a collapsed but disordered I a nd fluc^na 

spectra. Measunng the average number of unexchanged amide protons as a function of time for each of thTdesioned 
pro emsrasulj as follows (data no, shown,: « 90 protects 1 3 protons for over 20 hours of ZSgSfigSl 
25 C. The a 90 exchange curve is indistinguishable from GpTs (not shown). « 85 also maintains a well-protected set 

o Iv 3S£Eli: J ^T^lr ° f ° rdered MkB Pr0teinS " The number ° f P—*- protoKwevit 
only about half that of a 90. The difference is likely due to higher flexibility in some parts of the a 85 struck In 

£2!22i B 107 were fully ^^^^^^^^X^ 

[0236] Near UV CD spectra and the extent of 8-anilino-1-naphthalene sulfonic acid (ANS) Wndina were usort to 
assess ;the structural orderingof the proteins. The near UVCDspectra of .rSSandaOO^s^^Tex^ 
for protems with aromatic residues fixed in a unique tertiary structure while a70 and «107 harfeafureieL sp^a 

TSL aTT Wtt, H m * fl ! ! r ° matiC reSidUeS ' as non " na,ive «"'aPsed states or unfoSd pmteins «70 
we »' as "! d,cated "V a •"•**« intensity increase and blue shift of the ANS emission spectnl This 
strong binding suggests that a70 possesses a loosely packed or partially exposured duster of hXpnoofe^idues 
accessible to ANS. ANS binds «85 weakly, with only a 25% increase in emission intensity. simNar to tne^ssoaaS 
seen for some A r«t,veprote.ns(SemisotrK,v.efa/..Biopoi y mera 

fluorescence. All of the proteins migrated as monomers during size exclusion chromatograp^ 9 
oSnnJ^T* ° 90 ' S awe, '-P acked P'^in by all criteria, and it is more stable than the naturally 

!,^» % • POSS ' bly b6CaUSe ° f inCreaSed ^hobfe surface burial. „ 85 is also a stable Ordered 
protein, albert with greater motional flexibility than a90. as evidenced by its NMR spectrum and hydrogen ex^YnS 
b^avior a70 has all the features of a disordered collapsed globule: a non-cooperative thermal traXn ™NMR 
spectra dispersion or am.de proton protection, reduced secondary structure content and strong ANS todno nl07 k 
a completely unfolded chain, likely due to its iack of large hydrophobic residues to hoW ES^ETSZl 
trend is a loss of protein ordering as a decreases below 0 90 «*jeuier. i ne clear 

fwimO^^ 

with 7 8 < „ < n% ; L r 9 ' S m ' na,ed b/ P 3 ^" 9 Spedfid,y resu,,in 9 in well-ordered proteins. In regime 2 
wrth 0.8 S a < 0.9. packing forces are weakened enough to let the hydrophobic force drive larger residues into decora 

toSla^T^f B T"" PaCked Pr ° Wn With -creased structural motion. In regime « < 0 8 pacZ 

forces are educed to such an extent that the hydrophobic force dominates, resulting in a fluctuating Dartiailv 22 

S5f *!? 6dUCed Speafia, y 030 ■» used «° ^gn protein cores with alternative packings. 

SfLIl ^ni If 960 ' 819 benefitS ° f redUCed P 8 *"" ^straints. protein cores should be designed with the 
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phenylalanine respectively, compared to alanine and valine in cc90, force W43 to expose 91 A 2 of nonpolar surface 
compared to 19 A 2 in a90. The hydrophobic driving force this exposure represents seems likely to stabilize alternate 
conformations that bury W43 and thereby could contribute to 085*5 conformational flexibility (Dill, 1985; Onuchic, ef 
a/., 1996). In contrast to the other core positions, a residue at position 43 can be mostly exposed or mostly buried 
depending on its side-chain conformation. We designate positions with this characteristic as boundary positions, which 
pose a difficult problem for protein design because of their potential to either strongly interact with the protein's core 
or with solvent. 

[0240] A scoring function that penalizes the exposure of hydrophobic surface area might assist in the design of 
boundary residues. Dill and coworkers used an exposure penalty to improve protein designs in a theoretical study 
(Sun, eta/.. Protein Eng. 8(12)1205-1213(1995)). 

[0241] A nonpolar exposure penalty would favor packing arrangements that either bury large side chains in the core 
or replace the exposed amino acid with a smaller or more polar one We implemented a side-chain nonpolar exposure 
penalty in our optimization framework and used a penalizing solvation parameter with the same magnitude as the 
hydrophobic burial parameter. 

[0242J The results of adding a hydrophobic surface exposure penalty to our scoring function are shown in Table 7. 
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[0243] Table 7 depicts the 15 best sequences for the core positions of GQ1 usina a - 0 85 vuifh™,f „ 

penalty. is the exposed nonpolar surface area in A 2 . 9 *° Ut eXp0sure 
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Table 8. (continued) 
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Table 8 depicts the 15 best sequences of the core positions of G01 using a = 0.85 with an exposure penalty. 



[0246] , 

A np is the exposed nonpolar surface area in A 2 . 
[0247] This sequence, a85W43V, replaces W43 with a valine but is otherwise identical to a85. Though the 8th and 
14th sequences also have a smaller side chain at position 43, additional changes in their sequences relative to ct85 
would complicate interpretation of the effect of the boundary position change. Also, a85W43V has a significantly dif- 
ferent packing arrangement compared to Gp1 , with 7 out of 11 positions altered, but only an 8% increase in side-chain 
volume. Hence, ot85W43V is a test of the tolerance of this fold to a different, but nearly volume conserving, core The 
far UV CD spectrum of a85W43V is very similar to that of Gpi with an ellipticity at 218 nm of -14000 deg cm2/dmol 
While the secondary structure content of <x85W43V is native-like, its T m is 65 °C, nearly 20 °C lower than <x85 In 
contrast to a85W43Vs decreased stability, its NMR spectrum has greater chemical shift dispersion than a85 (data not 
shown). The amide hydrogen exchange kinetics show a well protected set of about four protons after 20 hours (data 
not shown). This faster exchange relative to <x85 is explained by a85W43Vs significantly lower stability (Mayo & Bald- 
win. 1993). a85W43V appears to have improved structural specificity at the expense of stability, a phenomenon ob- 
served previously in coiled coils (Harbury, ef a/.. 1993). By using an exposure penalty, the design algorithm produced 
a protein with greater native-like character. 

[0248] We have quantitatively defined the role of packing specificity in protein design and have provided practical 
bounds for the role of steric forces in our protein design program. This study differs from previous work because of the 
use of an objective, quantitative program to vary packing forces during design, which allows us to readily apply our 
conclusions to different protein systems. Further, by using the minimum effective level of steric forces, we were able 
to design a wider variety of packing arrangements that were compatible with the given fold. Finally, we have identified 
a difficulty in the design of side chains that lie at the boundary between the core and the surface of a protein and we 
have implemented a nonpolar surface exposure penalty in our sequence design scoring function that addresses this 
problem. 
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Example 5 

Design of a full protein 

[0249J The entire amino acid sequence of a protein motif has been computed. As in Example 4 the second zinc 
finger module of the DNA binding protein 2f268 was selected as the design template. In order to assign the residue 
positions in the template structure into core, surface or boundary classes, the orientation of the Ca-Cp vectors was 
assessed relative to a solvent accessible surface computed using only the template Ca atoms. A solvent accessible 
surface for only the Ca atoms of the target fold was generated using the Connolly algorithm with a probe radius of 8 0 
A, a dot density of 10 A 2 , and a Ca radius of 1.95 A. A residue was classified as a core position if the distance from 
its Co. along its Ca-Cp vector, to the solvent accessible surface was greater than 5 A, and if the distance from its Cfl 
to the nearest surface point was greater than 2.0 A. The remaining residues were classified as surface positions if the 
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fhl r f ^,K iS tanCCS IT,*** CO ' al0n9 ** wtar - ,0 *» so,ven < accessible surface plus the distance from 
me ctoss.ficat.ons for Zrf268 were used as computed except that positions 1, 17 and 23 were converted frZ Z 
tenary structure and inaccuracies in the assignment ,n 106 

So,^ 

are in the boundary or core, one residue, position 8, has a Ca-Cp vector directed away ftJSSSS^^Si 
P^sdunn^^ 

mnl^ -» \L J' ■ fte comblned «™» and surface amino acid sets (16 amino acids) were 

considered at the boundary posrtions. Two of the residue positions (9 and 27) have . anoles LaSZnwll* 
set to Gly by the sequence selection algorithm to minimize backbone strain * ^ 9 " ^ 

[0251 J The total number of amino acid sequences that must be considered by the design algorithm is the indue* n f 
the number of possible amino acid types at each residue posftion The BBo motif resld3v««^^ -^u 

[0252J The optimal sequence, shown in Figurell. is called Full Sequence Design-1 (FSD-1) Even thounh a n «f »~ 
hydrophihe amino acids were considered at each of the boundary posftions the Sm^^M^S^ * 

between rts side-chain carbonyl oxygen and the backbone amide proton of residueTe TShTc^^rTZ^ 

any known protein sequence demonstrates the novelty of the FSD 1 ™1 ,ns * n,n ™ nL ver y ,ow '^entity to 
inflation from any protein motif was used 1^1^'^ " d « » S ^ 

Asn 14. wn,cn predated to form a hel,x N-cap, is among the most conserved surface positions. The strong 
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sequence conservation observed for critical areas of the molecule suggests that if a representative sequence folds 
into the design target structure, then perhaps thousands of sequences whose variations do not disrupt the critical 
interactions may be equally competent. Even if billions of sequences would successfully achieve the target fold, they 
would represent only a vanishingly small proportion of the 1 0 27 possible sequences. 

[0255] Experimental validation. FSD-1 was synthesized in order to characterize its structure and assess the per- 
formance of the design algorithm. The far UV circular dichroism (CD) spectrum of FSD-1 shows minima at 220 nm and 
207 nm, which is indicative of a folded structure (data not shown). The thermal melt is weakly cooperative, with an 
inflection point at 39 °C, and is completely reversible (data not shown). The broad melt is consistent with a low enthalpy 
of folding which is expected for a motif with a small hydrophobic core. This behavior contrasts the uncooperative thermal 
unfolding transitions observed for other folded short peptides (Scholtz, et a/., 1991). FSD-1 is highly soluble (greater 
than 3 mM) and equilibrium sedimentation studies at 100 pM, 500 pM and 1 mM show the protein to be monomeric. 
The sedimentation data fit well to a single species, monomer model with a molecular mass of 3630 at 1 mM, in good 
agreement with the calculated monomer mass of 3488. Also, far UV CD spectra showed no concentration dependence 
from 50 pM to 2 mM, and nuclear magnetic resonance (NMR) COSY spectra taken at 1 00 \iM and 2 mM were essentially 
identical. 

[0256] The solution structure of FSD-1 was solved using homonudear 2D *H NMR spectroscopy (Piantini, et a/., 
1982). NMR spectra were well dispersed indicating an ordered protein structure and easing resonance assignments. 
Proton chemical shift assignments were determined with standard homonudear methods (Vvuthrich, 1986). Unambig- 
uous sequential and short-range NOEs indicate helical secondary structure from residues 15 to 26 in agreement with 
the design target 

[02571 The structure of FSD-1 was determined using 284 experimental restraints (10.1 restraints per residue) that 
were non-redundant with covalent structure including 274 NOE distance restraints and 10 hydrogen bond restraints 
involving slowly exchanging amide protons. Structure calculations were performed using X-PLOR (Brunger, 1 992) with 
standard protocols for hybrid distance geometry-simulated annealing (Nilges, ef a/., FEBS Lett. 229:317 (1988)). An 
ensemble of 41 structures converged with good covalent geometry and no distance restraint violations greater than 
0.3 A (Table 9). 
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NMR structure determination: distance restraints, structural statistics and atomic root-mean-square (rms) 

" ^ 4 1 ™^"9 structure. SA i S the ave rage str^oture befor^n^ r^Tiu^tion 
(SA) r is the restrained energy minimized average structure, and SD is the standard deviation 
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and had only sequential and = 2 NOES. .. «. « ana a were fflsordered «,. y angular order parameters (34) < 0.78) 

tNonpola, side chains are from residues 3. 5. 7. 12. 18. 21. 22 and 25 which consitute the core of the protein. 

Ses SSfSSl^il iS f 3 M™*™*™* Cms) deviation from the mean of 0.54 A 

(restdues 3-26). Considenng the buned side chains (residues 3. 5, 7. 12, 18. 21, 22. and 25) in addition to the backbone 

0 he ZZ£ST ? 099 K indica8ng . ^ core of ,he mo,ecule fe we " ordered ■ 

o the ensemble of structures was examined using PROCHECK(Laskowski.ef a /.. J.Apr^.Crystallcflr 26-283M M3» 

fte remainder in the afiowed region of *, y space. Modest heterogeneity is present in the first strand (residues S 
which has an average backbone angular order parameter (Hyberts, eta/.. 1992) of <S> - 0 96 + 0 04 
second strand (restdues 9-12) with an <S> = 0.98 ± 0.02 and the helix (residues 15-26) vS, an <S^ 5 9^ 0 W 
Overal.FSD.1ts notably well ordered and. to our knowledge, is the ^^enoeJ^^^re^Ow 

STgn^^ 

!Sf5 J^ ^^"g pa«em of the hydrophobic core of the NMR structure ensemble of FSD-1 (Tyr 3 He 7 Phe 12 

1 JSSll ■ l,e22 ' a ^r he25)iSSimi ' artome « ed ^ng arrangement. Five of the 

fe^^*""-*™ ""at do not match their computed Xl angles an, He 7 and Phe 25. wr^™rS n rih 
ne ecttonsand .nstead exposes about 45% of its surface area because <rftte displacement ttesSl^cSne 

?60^1T*:r ,e 7 late -, C ° nV T ,y - LyS 8 35 Predicted * * e «*"»»• with TSven, ex^ura 

(60%) and Zl and Zz angles matching the computed structure. Most of the solvent exoosed ^ h ^ 

which preciudes examination of the predicted surface residue hydrogen E£ S^SSSS 
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from its sidechain carbonyt oxygen as predicted, but to the amide of G!u 17, not Lys 16 as expected from the design. 
This hydrogen bond is present In 95% of the structure ensemble and has a donor-acceptor distance of 2.6 ± 0.06 A. 
In general, the side chains of FSD-1 correspond well with the design program predictions. 

[0260] A comparison of the average restrained minimized structure of FSD-1 and the design target was done (data 
not shown). The overall backbone rms deviation of FSD-1 from the design target is 1 .98 A for residues 3-26 and onlv 
0.98 A for residues 8-26 (Table 10), 
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Table 10. 

Comparison of the FSD-1 experimentally determined structure and the design target structure. The FSD-1 structure 
is the restrained energy minimized average from the NMR structure determination. The design target structure is 
the second DNA binding module of the zinc finger 2if268 (9). 



Atomic rms deviations (A) 


Backbone, residues 3-26 




1.98 




Backbone, residues 8-26 




0.98 




Super-secondary structure parameters* 




FSD-1 




Design Target 


MA) 


9.9 




8.9 


e(degrees) 


14.2 




16.5 


^(degrees) 


13.1 




13.5 



least-square ptane fit to the Ca coordinates of the sheet (residues 3-12. 0 is the angle of inclination of the principal moment of the helix Ca atoms 
with the plane of the sheet Q is the angle between the projection of the principal moment of the helix onto the sheet and the projection of the a veraqe 
least-square fit line to the strand Co coordinates (residues 3-6 and 9-1 2) onto the sheet 

[0261] The largest difference between FSD-1 and the target structure occurs from residues 4-7, with a displacement 
of 3.0-3.5 A of the backbone atom positions of strand 1 . The agreement for strand 2. the strand to helix turn, and the 
helix is remarkable, with the differences nearly within the accuracy of the structure determination. For this region of 
the structure, the rms difference of <J>,y angles between FSD-1 and the design target is only 14 ± 9°. In order to quan- 
titatively assess the similarity of FSD-1 to the global fold of the target, we calculated their supersecondary structure 
parameters (Table 9) (Janin & Chothia, J. Mol. Biol. 143:95 (1980); Su & Mayo, Protein Sci. in press, 1997), which 
describe the relative orientations of secondary structure units in proteins. The values of 6. the inclination of the helix 
relative to the sheet, and Q, the dihedral angle between the helix axis andthe strand axes, are nearly identical. The 
height of the helix above the sheet, h, is only 1 A greater in FSD-1 . A study of protein core design as a function of helix 
height for Gb1 variants demonstrated that up to 1.5 A variation in helix height has little effect on sequence selection 
(Su & Mayo, supra, 1997). The comparison of secondary structure parameter values and backbone coordinates high- 
lights the excellent agreement between the experimentally determined structure of FSD-1 and the design target, and 
demonstrates the success of our algorithm at computing a sequence for this pjlci motif. 

[0262] The quality of the match between FSD-1 and the design target demonstrates the ability of our program to 
design a sequence for a fold that contains the three major secondary structure elements of proteins: sheet, helix, and 
turn. Since the ppa fold is different from those used to develop the sequence selection methodology, the design of 
FSD-1 represents a successful transfer of our program to a new motif. 

Example 6 ' 

Calculation of solvent accessible surface area scaling factors 

[0263] In contrast to the previous work, backbone atoms are included in the calculation of surface areas. Thus, the 
calculation of the scaling factors proceeds as follows. 

[0264] The program BIOGRAF (Molecular Simulations Incorporated, San Diego, California) was used to generate 
explicit hydrogens on the structures which were then conjugate gradient minimized for 50 steps using the DREIDING 
force field. Surface areas were calculated using the Connolly algorithm with a dot density of 10 A-2 using a probe 
radius of zero and an add-on radius of 1.4A and atomic radii from the DREIDING force-field. Atoms that contribute to 
the hydrophobic surface area are carbon, sulfur and hydrogen atoms attached to carbon and sulfur. 
[0265] For each side-chain rotamer r at residue position / with a local tri-peptide backbone r3, we calculated A\ 3 
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Iheex^d area of therotamerandits backbone In thepresence of the andyutheexnosed 
area of the rotamer and its backbone in the presence of the entire template t wnich includes the pi ted^S 
any side^ns not involved in the calculation (Figure 13). The deference between AD, t3 . andT^SSto?« 
buned by the template for a rotamer r at residue position /. For each pair of residue positions / an altiTr 
^l n d f r^S' ^ ^ eXP0SGd — ° f the r ° tamer P* - the presence of the ar^CTta 
that area by the template. The pairwise approximation to the total buried surface area Is: 

Equation 29: 



S?Sk ^ u 9Ure 13 ' ^ SeC ° nd SUm h EqUation 29 «"■«»*»• *>e buried area. We have therefore mi- 

Wthesecondsumbyascafefactorfwhose value * to be determined empirically. Expected^ 

[0267] Noting thai the buried and exposed areas should add to the total area, Z^o. ^ ^ surface 



Equation 30: 

, pJirvise 



[0268] The first sum of Equation 30 represents the total exposed area of each rotamer in the context of the orotein 
rotamers and is scaled by the same parameter f as in Equation 29 oeween 
[0269] Some insight into the expected value of f can be gained from consideration of a close-packed face centered 
cub,c lattice of spheres or radius r. When the radii are increased from r to A the surface aralon one Xr?S 
by a neighboring sphere is a**-* We ^r^c^^^J^^Z^^J^ 



{= 



true buried area 



pairwise buried area 
and noting that each sphere has 12 neighbors, results in: 

,_ 4 * R 2 



12x2nR (R-r) 



[0270] This yrelds f = 0.40. A close-packed face centered cubic lattice has a packing fraction of 74% Protein interiors 
haveasim arpacking ^tion.aHhough because many atom^ 

IT^Z ' l h ° Uld 66 3 ' 0Wer b ° Und f0f real P rotein «««• *» ««■««• residueTwhlT pSS 

fraction is lower, a somewhat larger value of f is expected pacKing 

SI!! V^H»S d ^ fr0m ,e " Pr ° ,einS ran9ing SiZe from 54 to 289 ^idues into core or nor«ore as 
follows. We classified resides as core or non^ore using an algorithm that considered the direction of each sWeSSS 

ol/^f T. ?IT Z ' he r rfaCe COmpUt6d USin9 ° nly ,he ,em P ,ate ato ™ "i.h a caZ radSof f£ A a 
££T/f l r J? "° add "°" radiUS - A residUe Was dassi,ied as a «*» Posi«0" » both the dis JcVfrom 
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Brookhaven Identifier 


Total Size 


Core Size 


Non-Core Size 


1enh 


54 


10 


40 


1pga 


56 


10 


40 


1ubt 


76 


16 


50 


1mol 


94 


19 


61 


1kpt 


105 


27 


60 


4azu-A 


128 


39 


71 


igpr 


158 


39 


89 


1gcs 


174 


53 


98 


1edt 


266 


95 


133 


1pbn 


289 


96 


143 



[0272] The classification into core and non-core was made because core residues interact more strongly with one 
another than do non-core residues. This leads to greater over-counting of the buried surface area for core residues. 
[0273] Considering the core and non-core cases separately, the value of f which most closely reproduced the true 
Lee and Richards surface areas was calculated for the ten proteins. The pairwise approximation very closely matches 
the true buried surface area (data not shown). It also performs very well for the exposed hydrophobic surface area of 
non-core residues (data not shown). The calculation of the exposed surface area of the entire core of a protein involves 
the difference of two large and nearly equal areas and is less accurate; as will be shown, however, when there is a 
mixture of core and non-core residues, a high accuracy can still be achieved. These calculations indicate that for core 
residues f is 0.42 and for non-core residues f is 0.79. 

[0274] To test whether the classification of residues into core and non-core was sufficient, we examined subsets of 
interacting residues in the core and non-core positions, and compared the true buried area of each subset with that 
calculated (using the above values of f). For both subsets of the core and the non-core, the correlation remained high 
{& = 1.00) indicating that no further classification is necessary (data not shown). (Subsets were generated as follows: 
given a seed residue, a subset of size two was generated by adding the closest residue: the next closest residue was 
added for a subset of size three, and this was repeated up to the size of the protein. Additional subsets were generated 
by selecting different seed residues.) 

[0275] It remains to apply this approach to calculating the buried or exposed surface areas of an arbitrary selection 
of interacting core and non-core residues in a protein. When a core residue and a non-core residue interact we replace 
Equation 29 with: 



Equation 31: 



K pairviec 



buried =S<A i.o'V^^A/^/ViAt 1 



and Equation 30 with Equation 32: 



where f, and f ; are the values of f appropriate for residues /and;, respectively, and f (lj) takes on an intermediate value 
Using subsets from the whole of 1 pga, the optimal value of f fj was found to be 0.74. This value was then shown to be 
appropriate for other test proteins (data not shown). 



Claims 



1 . A method executed by a computer under the control of a program, said computer including a memory for storing 
said program, said method comprising the steps of: 
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(A) receiving a protein backbone structure with variable residue positions; 

(B) classifying each variable residue position as either a core, surface or boundary residue- 

(C) establishing a group of potential rotamers for each of said variable residue positions, wherein at least one 
vanable residue position has rotamers from at least two different amino add side chains; and 

(D) analyzing the interaction of each of the said rotamers with all or part of the remainder of said protein to 
generate a set of optimized protein sequences, wherein said analyzing step includes the use of at least one 
scoring function. 

2 ' rlkJue th0d aCCOrdin9 10 daim 1 ' Wherein 3t ,eaSt ° ne Variab,e resldue ^ on emprises a surface or boundary 

3. The method according to claim 1 , wherein said analyzing step comprises a DEE computation. 

4. The method according to claim 1 , wherein said set of optimized protein sequences comprises the globally optimal 
protein sequence. a ' K 

5 * S'Z^SJreE ^ 1 ' WhCrein COmpUt8ti0n IS Sel6Cted from me group cons «t"«ng of original 

6. The method according to claim 1 , wherein said scoring function is selected from the group consisting of a Van der 
Waals potential scoring function, a hydrogen bond potential scoring function, an atomic solvation scoring function 
an electrostatic scoring function and a secondary structure propensity scoring function. 

7. The method according to claim 1, wherein said analyzing step includes the use of at least three scoring functions. 

8. The method according to claim 1 , wherein said analyzing step includes the use of at least four scoring functions. 

9. Themethodaccortingto^ 
results. 

10. The method according to claim 4, further comprising 

(D) generating a rank order list of additional optimal sequences from said globally optimal protein sequence. 

11. The method according to claim 10, wherein said generating includes the use of a Monte Carlo search. 

12. The method according to claim 1, wherein said analyzing step comprises a Monte Carlo search. 

13. The method according to claim 10, further comprising: 

(E) testing some or all of said protein sequences from said order list to produce potential energy test results. 

14. The method according to claim 13, further comprising: 

^analyzing the correspondence between said potential energy test results and theoretical potential energy 

15. A computer readable memory embodying a program, said program comprising code means that, when they are 
executed in a computer, direct: 

a side chain module to correlate a group of potential rotamers for residue positions of a protein backbone 
model classified as either a core, surface or boundary residue; 

a ranking module comprising at least two scoring function components to analyze the interaction of each of 
said rotamers w.lh all or part of the remainder of said protein to generate a set of optimized protein sequences. 

16 ' I^T^ readab,e mem0ry aCCOrdin9 '° daim 15 ' Wherein said scori "9 component includes a van der Waals 
scoring function. 
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17. The computer readable memory according to Claim 15, wherein said scoring component Includes an atomic sol- 
vation scoring function. 

18. The computer readable memory according to claim 15, wherein said scoring component includes a hydrogen bond 
scoring function. 

19. The computer readable memory according to claim 15, wherein said scoring component includes a secondary 
structure scoring function. 

20. The computer readable memory according to claim 15, further comprising an assessment module to assess the 
correspondence between potential energy test results and theoretical potential energy data. 



Patentanspruche 

1. Verfahren, das von einem Computer unter der Kontrolle eines Programms durchgefuhrt wird, wobei der Computer 
einen Speicher zum Speichem des Programms umfasst, wobei das Verfahren folgende Schritte umfasst: 

(A) den Erhalt einer Protein-Hauptkettenstruktur mit variablen Restepositionen; 

(B) das Klassifizieren jeder variablen Resteposition entweder als Kern-, Oberflachen- oder Randrest; 

(C) das Ermitteln einer Gruppe potentieller Rotamere fur jede der variablen Restepositionen, worm zumindest 
eine variable Resteposition Rotamere von zumindest zwei verschiedenen Aminosaureseitenketten aufweist; 
und 

(D) das Analysieren der Wechserwirkung jedes der Romatere mit dem gesamten oder einem Teil des ubrigen 
Proteins, urn einen Satz optimierter Proteinsequenzen zu bilden, worm der Schritt des Analysierens die Ver- 
wendung zumindest einer Auswertungsfunktion umfasst. 

2. Verfahren nach Anspruch 1 , worm zumindest eine variable Resteposition einen Oberflachen- oder Randrest um- 
fasst. 

3. Verfahren nach Anspruch 1 , worm der Schritt des Analysierens eine DEE-Berechnung umfasst. 

4. Verfahren nach Anspruch 1, worm der Satz optimierter Proteinsequenzen die global optimale Proteinsequenz 
umfasst. 



5. Verfahren nach Anspruch 1 , worm die DEE-Berechnung aus der aus Original-DEE und Goldstein-DEE bestehen- 
den Gruppe ausgewahlt ist. 

6. Verfahren nach Anspruch 1 , worin die Auswertungsfunktion aus der Gruppe ausgewahlt ist. die aus Van der Waals- 
Potential-Auswertungsfunktion, einer Wasserstotmruckenbindungs-Potential-Auswertungsfunktion, einer Atom- 
Solvatisiemngs-Auswertungsfunktion, einer elektrostatischen Auswertungsfunktion und einer Sekundarstruk- 
turneigungs-Auswertungsfunktion besteht. 

7. Verfahren nach Anspruch 1 , worin der Analyseschritt den Einsatz von zumindest drei Auswertungsfunktionen um- 
fasst. 



8. Verfahren nach Anspruch 1 , worin der Analyseschritt den Einsatz von zumindest vier Auswertungsfunktionen um- 
fasst 

9. Verfahren nach Anspruch 1, das weiters das Testen zumindest eines Elements aus dem Satz umfasst, urn Ver- 
suchsergebnisse zu erzielen. 

10. Verfahren nach Anspruch 4, weiters umfassend: 

(D) das Erzeugen einer Rangordnungsliste weiterer optimaler Sequenzen aus der global optimalen Protein- 
sequenz. 

11. Verfahren nach Anspruch 10, worin das Erzeugen den Einsatz einer Monte Carlo-Suche umfasst 
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12. Verfahren nach Anspruch 1 , worin der Analyseschritt eine Monte Carlo-Suche umfasst 

13. Verfahren nach Anspruch 20, weiters umfassend: 

(E) das Testen einiger der oder alter Proteinsequenzen aus der Randordnungsliste, urn Testergebnisse der 
potentiellen Energie zu erzeugen. 

14. Verfahren nach Anspruch 13, weiters umfassend: 

(F) das Analysieren der Entsprechung zwischen den Testergebnissen der potentiellen Energie und den theo- 
retischen Daten der potentiellen Energie. 

15. Computer-lesbarer Speicher, der ein Programm darstellt, wobei das Programm Codemittei umfasst, die wenn sie 
in einem Computer ausgefuhrt werden, folgendes dirigieren: ' 

ein Seitenkettenmodul zur Korrelation einer Gruppe potentielle Rotamere fur Restepositionen eines Protein- 

hauptkettenmodells, die entweder als Kern-, Oberflachen oder Randreste klassifiziert werden- 

ein Reihungsmodul, das zumindest zwei Auswertungsfunktionskomponenten umfasst, urn die Wechselwir- 

SiSSiiT R p 0t ? ere ^ ^ 9eS3mten ^ Te " deS Qbrigen Proteins zu analysieren, urn einen 
Satz optimierter Proteinsequenzen zu erzeugen. 

16. Computer-lesbarer Speicher nach Anspruch 15, worin die Auswertungskomponente eine Van der Waals-Auswer- 
tungsfunkbon umfasst. 

1? - SSST 15> WOnn ^ A — gsko.ponente eine A,om-Solva«™gs. 

19 ' SSSS^^ l^nd.eAuswertu^po^.eeineSeRund.rs^-Aus- 

20. Computer-lesbarer Speicher nach Anspruch 15. der weiters ein Bewertungsmodul zur Bewertung der Entspre- 
chung zwischen Testergebnissen der potentiellen Energie und theoreHschen Daten der potentiellen Energie urn, 

Revendications 

1. Methode executee par un ordinateur sous le controle d'un programme, ledit ordinateur comprenant una memoirs 
pour stacker led.t programme, ladite methode comprenant les etapes de : 

(A) recevoir une structure d'epine dorsale de proteine avec des positions variables des residus - 
B classer chaque position variable de residu sous la forme de residu de coeur. de surface ou de frontiere ■ 

(C) etablir un groupe de rotameres potentiels pour chacune des positions variables de residu oil au moins 

une position vanablede residuadesrotameresd'aumoi^deuxchafneslateralesdifferentesd'acides amines ■ 
et ' 

(D) analyser interaction de chacun desdits rotameres avec toute ou partie du reste de ladite proteine pour 

rf™inc t S T enCeS °P fimis6es * »a Proteine. ou ladite etape d'analyse comprend utilisation 

daumoinsunefonctiondemarquage. 

2 ' olfsmite Se, °" 13 reVendication 1 ' oCl au moins une P™'™ v ^able d'un residu comprend un residu de surface 

3. Methode selon la revendication 1 , ou ladite etape (fanalyse comprend un calcul DEE. 

4. Methode selon la revendication 1 . ou ledit groupe de sequences optimises de proteine comprend la sequence 
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globalement optimale de la proteine. 

5. Methode selon la revendication 1 , oil ledit calcul DEE est selectionne dans le groupe constituant en DEE original 
et DEE Goldstein. 

6. Methode selon la revendication 1 , ou ladite fonction de marquage est selectionnee dans le groupe consistant en 
fonction de marquage de potentiel de Van der Waals, fonction de marquage de potentiel de liaison hydrogene, 
fonction de marquage de solvatation atomique. fonction de marquage electrostatique et fonction de marquage de 
tendance de structure secondaire. 

7. Methode selon la revendication 1, ou ladite etape d'analyse comprend ('utilisation d'au moins trois fonctions de 
marquage. 

8. Methode selon la revendication 1 , ou ladite etape d'analyse comprend I'utilisation d'au moins quatre fonctions de 
marquage. 

9. Methode selon la revendication 1 , comprenant de plus le test d'au moins un membre dudit groupe pour produlre 
des resultats experimentaux. 

10. Methode selon la revendication 4, comprenant de plus 

(D) la production cfune liste d'ordres de rang de sequences optimales additiortnelles a partir de ladite sequence 
globalement optimale de la proteine. 

11. Methode selon la revendication 10, ou ladite production comprend I'utilisation d'une recherche de Monte Carlo. 

12. Methode selon la revendication 1, od ladite etape d'analyse comprend une recherche de Monte Carlo. 

13. Methode selon la revendication 10. comprenant de plus : 

(E) le test de certaines ou de la totalite des sequences de la proteine a partir de ladite liste d'ordres pour 
produire des resultats de test d'energie de potentiel. 

14. Methode selon la revendication 13, comprenant de plus : 

(F) I'analyse de la correspondance entre les resultats de test d'energie de potentiel et les donnees theoriques 
d'energie de potentiel. 

1 5. Memoire lisible a I'ordinateur mettant en oeuvre un programme, ledit programme comprenant des moyens de code 
qui, quand ils sont executes par un ordinateur, dirigent : 

un module de chaine lateral pour mettre en correlation un groupe de rotameres potentiels pour les positions 
des residus d'un modele d'epine dorsale de proteine classe comme un residu de coeur, de surface ou limite, 

un module de rang comprenant au moins deux composants de fonction de marquage pour analyser (Interaction 
de chacun desdits rotameres avec toute ou partie du restant de ladite proteine pour generer un groupe de 
sequences optimisees de la proteine. 

16. Memoire lisible a I'ordinateur selon la revendication 15, ou ledit composant de marquage comprend une fonction 
de marquage de Van der Waals. 

17. Memoire lisible a I'ordinateur selon la revendication 15, ou ledit composant de marquage comprend une fonction 
de marquage par solvatation atomique. 

18. Memoire lisible a I'ordinateur selon la revendication 15, ou ledit composant de marquage comprend une fonction 
de marquage par liaison d'hydrogene. 

19. Memoire lisible a I'ordinateur selon la revendication 15, ou ledit composant de marquage comprend une fonction 
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de marquage de structure secondare. 

Memoire Hsible a I'ordinateur selon la revendication 15, comprenant de plus un module devaluation pour evaluer 
la correspondance entre les resultats de test d'energie de potentiel et les donnees theoriques d'energie de poten- 
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